Re: rdma kernel tree
Doug, Do you have everything in place such that what you stage for upstream is subject to linux-next auto merge tests and such? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On 5/27/2015 12:11 AM, Jason Gunthorpe wrote: On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote: - ifcfg/udev/networkmanager: So what happens when I do ip link add link ib0 name ib0.1 type ipoib And get two IPoIB interfaces with the same GUID? I doubt any sane user would want to apply the same config to those two interfaces. No, they probably don't want to apply the same rules to both interfaces. I'm not entirely sure I agree with the argument though. I fully expected this to fail without a pkey argument on the ip command line. Does that matter to the above tools? Are they using PKey,GUID as their key? The net stack doesn't allow users to do the same thing with Ethernet devices, so I'm not sure we shouldn't be disallowing this as opposed to creating duplicate devices that are identical in all ways except name. The netstack doesn't allow it for ethernet because it would create a 2nd identical LLADDR, and LLADDRs must be unique. Because the QPN is part of the LLADDR IB can create two interfaces on the same physical port that are completely separated by hardware. Read Haggi's email, he explains how they plan to use this to create interfaces that can be delegated to namespaces. It is not a bad idea really.. So prepare for a world where each namespace has a child IPoIB interface with a unique QPN, but the same Pkey and GUID as the host. The breakage from assuming GUID == unique will become a problem. Unbreaking it is a UAPI change, not impossible, but do we really care enough about 8 or 20 to push for that? In truth, at least right now, it's all moot. Since we can't set the subnet prefix, the qpn, or the flags, anything above 8 bytes is immutable regardless of how many bytes we pass in. So even if we say we aren't going to change the UAPI and for everything to 20, the real world result is that 8 works exactly the same and has no functional difference. Not quite, in the 20 byte format the 8 bytes of the GUID are in the last 8/20 bytes, so the app would have to place 12 zeros and then the GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) This is why the question of 'what is ILFA_VF_MAC' is so important, every option presented (MAC,GUID,LLADDR) are incompatible with each other. I agree with Doug that to be practical here, libvirt and Co. would really want to use rtnetlink based provisioning of IB VFs, at least in a similar manner done for Eth VFs. So with this assumption at hand, my vote goes to having user-space to provide the eight bytes of vGUID through the ndo_set_vf_mac call into IPoIB. I don't see the real value of user space providing the four zero bytes (19-16) and the 8 bytes of the subnet prefix provided by the SM. My personal thinking is that the important thing to address is consistency between what the virtualization system provisions on the host (ndo_set_vf_mac) to the DHCP server scheme they build. Do we have a go here? Also few comments on DHCP: If we're talking on different vlans/Eth or pkey/IB - it's totally OK for two entities (== IPoIB instances under IB) on the physical subnet to use the same identifier (IB/GUID, Eth/MAC) if they are on two different L2 broadcast domains. The DHCP server is expected to have a different mapping scheme per such virtual L2 subnet. For SRIOV, we don't expect two VFs on the network to use the same vGUID, so DHCP wise we should be OK. Today the Client-ID works fine for SRIOV schemes which are based on 8byte vGUIDs. Re two IPoIB child devices using the same GUID and the same pkey, we can enhance the system and take advantage of IB Alias GUIDs which today are only used for SRIOV for Para-Virtual and other environments too, thanks for the heads up on the necessity of doing so. What does get return? If we accept 8 or 20, then get must return 20. The get has to return 20 regardless. It's the only accepted means of getting all 20 bytes of the LLADDR. You are conflating IFLA_ADDRESS and IFLA_VF_MAC. IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that makes no sense, but it wouldn't break existing stuff. Just to make sure we're on the same page, this thread deals with using rtnetlink's IFLA_VF_MAC(== struct ifla_vf_mac) for provisioning vGUID for IB VFs, through the PF IPoIB interface, not attempting to use IFLA_ADDRESS. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: Jason, you (U2 BTW) play really, really hard - refusing to say **one** word on your approach towards the built-in udata mechanism for uverbs which I asked you to comment on. And I asked to see the user space side and you have angrily refused every time. AFAIR I never ever refused to show any piece of code which went under my hands towards Linux to any-one. For future reference, when someone asks a question and you go off on an tangental rant and ignore the question, then that process repeats, still without answering the question - most english speakers would call that refusing to answer the question. It is not looked upon kindly. Jason, It's not that you asked to see the code ala hey, do you happen to have a git with the user space code for people to inspect while doing the review on the kernel part, but rather U2 saying in a definitive manner that posting the user space code should be imposed as pre-requirement to acceptance of the kernel parts. In parallel, U2 totally rejected our usage of udata @ on the spot and when I mentioned that it's a feature which was designed for that purpose exactly and from day one, it took me three reminders to get a you know what, maybe that can fly comment from you. So here I started to realize that there's something in the attitude that goes beyond the details, and I made the you're not the boss comment. I'm really confused why you didn't just post the github links last week, the patches are all a month old on there. Was it really so offensive to you that we wanted to review the kernel UAPI patches and verbs patches together? So I guess we are both playing hard. I disagree, you act as sort of being the boss here, stating every now and then your preferences and way of engineering things as the ultimate guidelines for Linux and/or RDMA engineering. Lets be clear Or, I have given you (and others) some very pointed comments and advice, privately and publicly. That is not 'being the boss' that is contributing to fix our community. When it comes to my patch comments, I give direction on what I want to see to provide my Reviewed-By. If you don't like it, then find someone else to review your code. I'm busy, and I don't work for you. If I don't want to review some patches because my questions have been ignored, then that is entirely my perogative. Pinging me *three times this week* on this stupid timestamp thing, is somewhere between annoying and offensive. See above, why I made these pings. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
out of order packets when querying all ports in a switch through OpenSM
Hello, I am trying to send a port info request to all the ports in a switch, one after the other. I am sending all requests to the same switch in sequence through the same directed routing path: Request port number 1 Request port number 2 Request port number 3 Request port number 4 But sometimes I get responses out of order: Response port number 2 Response port number 1 Response port number 3 Response port number 4 Is it possible that the switch is processing port requests out of order (or maybe I am doing something wrong)? Best regards, Jesus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: out of order packets when querying all ports in a switch through OpenSM
Hi Jesus, On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote: Hello, I am trying to send a port info request to all the ports in a switch, one after the other. I am sending all requests to the same switch in sequence through the same directed routing path: Request port number 1 Request port number 2 Request port number 3 Request port number 4 But sometimes I get responses out of order: Response port number 2 Response port number 1 Response port number 3 Response port number 4 Is it possible that the switch is processing port requests out of order (or maybe I am doing something wrong)? One way this could occur is that request to switch for port number 1 either does not make it to switch or response does not make it back and it times out and is retransmitted by kernel. This depends on how umad is setup in terms of retries and timeout. Is it always the same switch or switch type ? What switch is it ? -- Hal Best regards, Jesus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdma kernel tree
On Wed, 2015-05-27 at 12:50 +0300, Or Gerlitz wrote: Doug, Do you have everything in place such that what you stage for upstream is subject to linux-next auto merge tests and such? Or. No, I have 0day in place, but not linux-next to my knowledge (unless Stephen already set it up, but I haven't asked him to yet). Stephen, can you set it up so that the linux-next tag from git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added to the linux-next testing please? -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 02/14] IB/mad: Create an RMPP Base header
On Wed, May 20, 2015 at 04:03:33PM -0600, Jason Gunthorpe wrote: On Wed, May 20, 2015 at 05:23:45PM -0400, ira.weiny wrote: On Wed, May 20, 2015 at 12:27:35PM -0600, Jason Gunthorpe wrote: On Wed, May 20, 2015 at 04:13:23AM -0400, ira.we...@intel.com wrote: -struct ib_rmpp_mad { +struct ib_rmpp_base { struct ib_mad_hdr mad_hdr; struct ib_rmpp_hdr rmpp_hdr; +} __packed; + +struct ib_rmpp_mad { + struct ib_rmpp_base base; u8 data[IB_MGMT_RMPP_DATA]; }; Why can't we just use: u8 data[]; And replace various sizeof(ib_rmpp_mad) with a rmpp_payload_size() call? I don't think it makes much difference. I think there is just 1 place we use that. There are many lines that are just churning rmpp_mad to rmpp_base, using a flex array avoids that entirely. Ok. Similarly patch 12 doesn't have to introduce a opa specific structure anymore. Ok. The same sort of comment probably applies to the mad_hdr/etc as well, but I didn't look very closely. This would mean changing all the existing drivers which I'm not really keen on doing. Existing drivers using IBTA MADs expect the MADs to be exactly 256 bytes and that is exactly how struct ib_mad is defined. So I think we need to keep the definitions for drivers to use. However, I think we can clean up the use of sizeof(ib_mad) in the MAD code such that it is agnostic to the device/port type once the MAD size patch is in place. I'll look at doing that early in the series. Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Wed, May 27, 2015 at 01:53:11PM -0400, Doug Ledford wrote: This basic idea is exactly why I think we should stick with the 20 byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the VF what QPN to use for IPoIB (if we ever see HW support to implement that) If we use 8 bytes then that route is closed off forever. And that's exactly as it should be. If we allow setting all 20 bytes via the VF_MAC calls, then we violate the guests should behave like they are on bare metal as much as possible rule. As a host, we get a GUID and if we want to control the QPNs for IPoIB (and indeed if we want No, that doesn't really work. The PF's QPN pool is shared between all VF guests. The VF HCA does a call to the PF driver to get a free QPN to use. The VM cannot unilaterally select a QPN without global coordination. In other words, if a QPN is to be forced for IPoIB then it must come from the PF, which must get it from the orchestration software, which must globally coordinate to ensure the QPN is available on every PF the VM may want to run on. Similar to how the MAC must be globally unique. Will this ever be done? Probably not. But if it was, this interface would be the sanest way to go. Anyhow - as far as I'm concerned, fixing the IFLA_VF_MAC to have a proper length nullifies my concern. If we really needed we can switch to a 20 byte format with good compatability options in the future once we have a proper length. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote: On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: Jason, you (U2 BTW) play really, really hard - refusing to say **one** word on your approach towards the built-in udata mechanism for uverbs which I asked you to comment on. And I asked to see the user space side and you have angrily refused every time. AFAIR I never ever refused to show any piece of code which went under my hands towards Linux to any-one. For future reference, when someone asks a question and you go off on an tangental rant and ignore the question, then that process repeats, still without answering the question - most english speakers would call that refusing to answer the question. It is not looked upon kindly. Jason, It's not that you asked to see the code ala hey, do you happen to have a git with the user space code for people to inspect while doing the review on the kernel part, but rather U2 saying in a definitive manner that posting the user space code should be imposed as pre-requirement to acceptance of the kernel parts. I really didn't Or: First ask: 'We can't really look at the uapi changes here without also seeing the verbs side changes.' (I know others on the list feel the same, so I use 'we') Second ask: 'Lets see the verbs side and I'll let you know.' (.. to your questions based on my review comments ..) Third ask: 'I think we need to have the same policy.' (.. To Yann's point that other kernel communities have a mandatory UAPI policy) Fourth (exasperated) ask: 'Was it not clear? Yann and I asked to see the user space side before reviewing this series further.' I know you are ESL, and I cut you alot of slack, but *come on* - that is incredibly soft language, and certainly not bossing and imposing in a definitive manner a blanket requirement on all patches. Advice: You would be well served to spend a bit more time on your emails. I have no idea what 'but rather U2 saying' means, for instance. Sometimes I just guess at what you are trying to say :| That is the price we pay for an inclusive international community, but everyone needs to be careful before starting a flame war based on percived slight in the text and phrasing of a message. email is hard. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote: Because the QPN is part of the LLADDR IB can create two interfaces on the same physical port that are completely separated by hardware. Read Haggi's email, he explains how they plan to use this to create interfaces that can be delegated to namespaces. It is not a bad idea really.. Yes, it is actually. The whole reason we went to GUID matching long ago was because of this exact issue. I reflected on this some more last night, and yes, I am leaning toward 'bad idea' direction too. Too much stuff breaks if you create multiple children with the same pkey/guid: - RDMA CM cannot disambiguate CM packets between them - DHCP cannot tell them apart - Net scripts/network manager won't work - IPv6 becomes totally broken That means the namespace stuff will have to create children using GUID aliases.. Glad we agree on that ;-) The *only* way this will ever be a workable item is if we A) reserve a number of queue pairs from the driver specifically for IPoIB use and B) specify which queue pairs go to which IPoIB devices at IPoIB module load This basic idea is exactly why I think we should stick with the 20 byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the VF what QPN to use for IPoIB (if we ever see HW support to implement that) If we use 8 bytes then that route is closed off forever. And that's exactly as it should be. If we allow setting all 20 bytes via the VF_MAC calls, then we violate the guests should behave like they are on bare metal as much as possible rule. As a host, we get a GUID and if we want to control the QPNs for IPoIB (and indeed if we want to control how many IPoIB interfaces and on what P_Keys) then we must create config files in /etc/sysconfig/network-scripts (on Red Hat, similar requirements on other distros) that would instruct the OS to create exactly what we want. But, the key point is that we are only given a GUID, and we must create everything else from our config files. Guests should be the same way. They only get the GUID to start, then the guest disk image with its self contained configuration will take over and control the rest. Not quite, in the 20 byte format the 8 bytes of the GUID are in the last 8/20 bytes, so the app would have to place 12 zeros and then the GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) This is why the question of 'what is ILFA_VF_MAC' is so important, every option presented (MAC,GUID,LLADDR) are incompatible with each other. For Ethernet devices, it's the MAC. The equivalent of MAC on IB is the GUID. I would leave it at that. Yes, both arguments can be made: - Our netlink end point is targetting an IPoIB interface, and the equivelent to an Ethernet MAC in IPoIB language is the LLADDR. - Our netlink interface is targetting the hardware under the IPoIB interface and that MAC equivilent is the GUID IPoIB devices are constructs on top of the GUID/link, and you can have 10 IPoIB interfaces between the parent and children, but we don't need to specify all of those LLADDRs, we just need to give a unique GUID and allow the guest OS to create their own IPoIB devices on top of that. As I've said, I would like to see netdev review that idea before we merge any patches.. There are pragmatic downsides to the 8 byte choice: Userspace completely looses the ability to size the address without a table based on link type. That is terrible in the context of netlink's design. Well, let's just be clear: netlink/iproute2 screwed the pooch on their implementation of this stuff. Any solution that doesn't include fixing this up in some way is not really a good solution. For instance iproute2 would need IB specific code to format the 'ip link show' (review print_vfinfo in iproute2) and to length check 'ip link set vf mac' If we do use 8, then it would be ideal (and my strong preference) to also fix the IFLA_VF_MAC message to have a working length. I think that could be done compatibly with a bit of work. At least that way iproute2 can be kept clean when it learns to do IB, and we could have the option again of using 20 someday if we need. So to be clear, to go with the 8 byte option I suggest: - Engage netdev/iproute and confirm they are philosophically OK with IFLA_VF_MAC != IFLA_ADDRESS - Make a kernel patch to properly size the IFLA_VF_MAC message - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) - Drop in the IB patch Sounds like a reasonable plan. Or, this is your patch set, are you going to pick up these action items? -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: out of order packets when querying all ports in a switch through OpenSM
The switches are unmanaged. Is it possible to ensure in-order packets with managed switches? Jesus On Wed, May 27, 2015 at 4:30 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote: On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote: Hi Hal, Thanks for your quick response :) What you say here makes sense. I have seen this behaviour in one of the switches, but I don't discard that this problem can arise in other switches. Note that for MADs, transaction ID is looser than traditional transaction ID semantics in that the combination of TID, SGID, and MgmtClass is different from that of any currently executing operation and does not imply ordering. If there is ordering required, it is the responsibility of the manager to enforce that. The switch is: Infiniscale-IV Mellanox Technologies 4xQDR Is it a managed or unmanaged switch ? -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support
On Tue, 2015-05-19 at 16:17 +, Liran Liss wrote: From: Hefty, Sean [mailto:sean.he...@intel.com] these remaining resources may be device-specific. The proposed framework first of all allows a provider to indicate whether hot-removal is supported (i.e., the presence of the 'disassociate_ucontext' callback), and if so, allow the provider to perform the proper cleanup so that the corresponding user-space driver will continue to function. The approach seems strange. The driver knows that it is being removed. It was informed up front which open contexts were associated with user space processes. But the driver calls up to indicate that it is being removed, so that the upper layer can call back down to tell the driver to process the removal. I wasn't asking what this series did. I was asking why the uverbs driver just can't delete the underlying resources. That's the natural thing to expect. I suppose that the main issue would be handling existing user memory mappings, which cannot be just invalidated -- the user-space driver may not be aware of the device removal and may access this memory concurrently, and we don't want it to crash. In this case, you are mapping it out of the device BAR space and into a random kernel page, yes? So, if the driver doesn't catch the DRIVER_FATAL event and process that to mean don't bother touching this RDMA device any more, it's going to write to a mailbox that no longer responds and have infinite timeouts, yes? Essentially meaning all mailbox type operations just go into lala land from here on out, right? The meaning of these mappings is device specific: some devices only write them, while others read them, expecting some specific format. That's why we need device-specific code for this. While it is true that the device initiates the removal process, the current flow is to let every ib_client first detach itself before device-specific cleanups. In this regard, ib_uverbs is just another client. In addition, the per-open (fd) state is held in ib_ucontext, which is maintained by ib_uverbs. The device driver doesn't hold a duplicate list of open HCA handles, so it relies on ib_uverbs to iterate the relevant contexts and trigger the device-specific stuff. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: out of order packets when querying all ports in a switch through OpenSM
On 5/27/2015 11:31 AM, Jesus Camacho Villanueva wrote: The switches are unmanaged. Is it possible to ensure in-order packets with managed switches? No; same issue as I previously mentioned exists with both. Unmanaged switches just use firmware whereas managed switches have CPU and kernel in addition to firmware so the system is more complex. -- Hal Jesus On Wed, May 27, 2015 at 4:30 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote: On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote: Hi Hal, Thanks for your quick response :) What you say here makes sense. I have seen this behaviour in one of the switches, but I don't discard that this problem can arise in other switches. Note that for MADs, transaction ID is looser than traditional transaction ID semantics in that the combination of TID, SGID, and MgmtClass is different from that of any currently executing operation and does not imply ordering. If there is ordering required, it is the responsibility of the manager to enforce that. The switch is: Infiniscale-IV Mellanox Technologies 4xQDR Is it a managed or unmanaged switch ? -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote: On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote: + struct srcu_struct disassociate_srcu; There is no need for rcu for this, use a rw sem. The rcu was used becuase it's on the hot path I assume. Do we have numbers on whether a rwsem vs. an rcu matters performance wise? If the rcu actually helps performance, then I'm inclined to leave it, but if it doesn't, then I'd agree that a rwsem is simpler and easier to deal with. @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file, return -EFAULT; } + /* Taking ref count on uverbs_file to make sure that file won't be +* freed till that event file is closed. It will enable accessing the +* uverbs_device fields as part of closing the events file and making +* sure that uverbs device is available by that time as well. +* Note: similar is already done for the async event file. + */ + kref_get(file-ref); Is this a bug today? It doesn't look like it, but this stuff does look wrong. Woulnd't this would make more sense for ib_uverbs_alloc_event_file to unconditionally grab the kref and unconditionally release it on release? The existing code for this looks broken, in ib_uverbs_get_context all the error paths between ib_uverbs_alloc_event_file and the kref_get(file-ref) are wrong - the will result in fput() which will call ib_uverbs_event_close, which will try to do kref_put and ib_unregister_event_handler - which are no longer paired. [I recommend moving the kref_get and ib_register_event_handler into ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths are clearly paired instead of being partially open coded in call sites] Fix all this in a seperate patch to add the needed change in kref semantics please. Seconded. - if (!try_module_get(dev-ib_dev-owner)) { - ret = -ENODEV; + mutex_lock(dev-disassociate_mutex); + if (dev-disassociated) { + ret = -EIO; goto err; } - file = kmalloc(sizeof *file, GFP_KERNEL); + /* In case IB device supports disassociate ucontext, there is no hard +* dependency between uverbs device and its low level device. +*/ + module_dependent = !(dev-flags UVERBS_FLAG_DISASSOCIATE); + + if (module_dependent) { + if (!try_module_get(dev-ib_dev-owner)) { + ret = -ENODEV; + goto err; Again? Why I do I keep pointing this same basic thing to Mellanox people: If you hold a X then you hold the ref to X as well. So, if the core code is holding function pointers to module code, then the core code holds a module ref. When the core code null's those function pointers, then it can release the module ref. Seconded. This might work today like this (I'm not entirely sure), but it makes no sense at all. I'll look more closely in a few weeks once the rwsem change is done. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.
On Wed, May 27, 2015 at 11:20:05AM +0530, Aneesh Kumar K.V wrote: j.gli...@gmail.com writes: Noted your grammar fixes. diff --git a/mm/Kconfig b/mm/Kconfig index 52ffb86..189e48f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT when kswapd starts. This has a potential performance impact on processes running early in the lifetime of the systemm until kswapd finishes the initialisation. + +if STAGING +config HMM + bool Enable heterogeneous memory management (HMM) + depends on MMU + select MMU_NOTIFIER + select GENERIC_PAGE_TABLE What is GENERIC_PAGE_TABLE ? Let over of when patch 0006 what a seperate feature that was introduced before this patch. I failed to remove that chunk. Just ignore it. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: out of order packets when querying all ports in a switch through OpenSM
Hi Hal, Thanks for your quick response :) What you say here makes sense. I have seen this behaviour in one of the switches, but I don't discard that this problem can arise in other switches. The switch is: Infiniscale-IV Mellanox Technologies 4xQDR Jesus On Wed, May 27, 2015 at 3:55 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote: Hi Jesus, On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote: Hello, I am trying to send a port info request to all the ports in a switch, one after the other. I am sending all requests to the same switch in sequence through the same directed routing path: Request port number 1 Request port number 2 Request port number 3 Request port number 4 But sometimes I get responses out of order: Response port number 2 Response port number 1 Response port number 3 Response port number 4 Is it possible that the switch is processing port requests out of order (or maybe I am doing something wrong)? One way this could occur is that request to switch for port number 1 either does not make it to switch or response does not make it back and it times out and is retransmitted by kernel. This depends on how umad is setup in terms of retries and timeout. Is it always the same switch or switch type ? What switch is it ? -- Hal Best regards, Jesus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: out of order packets when querying all ports in a switch through OpenSM
On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote: Hi Hal, Thanks for your quick response :) What you say here makes sense. I have seen this behaviour in one of the switches, but I don't discard that this problem can arise in other switches. Note that for MADs, transaction ID is looser than traditional transaction ID semantics in that the combination of TID, SGID, and MgmtClass is different from that of any currently executing operation and does not imply ordering. If there is ordering required, it is the responsibility of the manager to enforce that. The switch is: Infiniscale-IV Mellanox Technologies 4xQDR Is it a managed or unmanaged switch ? -- Hal Jesus On Wed, May 27, 2015 at 3:55 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote: Hi Jesus, On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote: Hello, I am trying to send a port info request to all the ports in a switch, one after the other. I am sending all requests to the same switch in sequence through the same directed routing path: Request port number 1 Request port number 2 Request port number 3 Request port number 4 But sometimes I get responses out of order: Response port number 2 Response port number 1 Response port number 3 Response port number 4 Is it possible that the switch is processing port requests out of order (or maybe I am doing something wrong)? One way this could occur is that request to switch for port number 1 either does not make it to switch or response does not make it back and it times out and is retransmitted by kernel. This depends on how umad is setup in terms of retries and timeout. Is it always the same switch or switch type ? What switch is it ? -- Hal Best regards, Jesus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Tue, 2015-05-26 at 15:11 -0600, Jason Gunthorpe wrote: On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote: Not so much ethernet world as netdevice world. The iproute2 program is used to configure any and all netdevices, ethernet or otherwise. Right now, we can abuse it to do the same here, but it uses the netdevice ndo_ ops, not rtnetlink to accomplish what it does, so we are limited in how we do thing if we want to maintain tool usage. Hmm? iproute2 does it over rtnetlink? Yes, it does. Sorry for being imprecise. It uses the specific netlink packet that maps back to the predefined ndo_ entry point and so you can't use that netlink on anything other than a netdevice with a defined ndo_ entry point. We can't use it, for example, to extend the new RDMA netlink operations that Kaike posted without also making modifications to iproute2 to know about the new netlink. The LLADDR for IPoIB *is* 20 bytes. Truncating it down is *broken userspace*: - DHCP: Not sending the full 20 bytes in the client request means the server cannot unicast the reply. This causes all sorts of problems and is discouraged in the RFCs these days. Reference? The RFCs I've read (4390 - 4361 - 3315) list a number of options (three at the moment), but the LLADDR options all call for using I'm talking about this part from RFC 4390: As described above, the link-layer address is unavailable to the DHCP server because the link-layer address is larger than the chaddr field length. As a result, the server cannot unicast its reply to the client. Therefore, a DHCP client MUST request that the server send a broadcast reply by setting the BROADCAST flag when IPoIB Address Resolution Protocol (ARP) is not possible, i.e., in situations where the client does not know its IP address. AFAIK, nobody ever solved this, and it actually does cause real world problems for cloud stuff as there is limited randomness in the TID. This is the network side of DHCP. a LLADDR from a device that is a permanent part of the machine (not common with add in cards), so the option most commonly used in IB is option 2, DUID Assigned by Vendor, aka GUID. According to that, truncating to 8 bytes is precisely what you are supposed to do. And, at least in all current Red Hat products, that's exactly how dhcp client creates the client-id. Using the GUID as the client-id is sort of OK from a policy perspective (ie what IP should I use), but it doesn't help the network side, and it breaks down completely when you create child interfaces. Actually, no, it doesn't. My standard test network inside Red Hat uses child interfaces and dhcp exclusively and it all works as expected. This is because each child interface has its own unique pkey and all of the devices include the pkey in the broadcast address, so one child does not see another child's broadcast reply, not the parent's. It does, however, mean that you only want a single link on a given pkey for a given device, which is why adding a second ipoib link without a unique pkey seems so broken to me. Basically, the dhcp server not having the LLADDR at all is a pretty big hack.. No other network I know of runs DHCP like that. That's how they decided to solve the issue in the RFCs, so that's what we have. - ifcfg/udev/networkmanager: So what happens when I do ip link add link ib0 name ib0.1 type ipoib And get two IPoIB interfaces with the same GUID? I doubt any sane user would want to apply the same config to those two interfaces. No, they probably don't want to apply the same rules to both interfaces. I'm not entirely sure I agree with the argument though. I fully expected this to fail without a pkey argument on the ip command line. Does that matter to the above tools? Are they using PKey,GUID as their key? They are pkey aware, yes. There is the parent device, and all non-default pkey devices are listed as a pkey device and given a pkey number and they share the same GUID. They are considered children of the parent, just like with vlan setups. The net stack doesn't allow users to do the same thing with Ethernet devices, so I'm not sure we shouldn't be disallowing this as opposed to creating duplicate devices that are identical in all ways except name. The netstack doesn't allow it for ethernet because it would create a 2nd identical LLADDR, and LLADDRs must be unique. And as far as the configuration scripts (at least on Red Hat) as well as NetworkManager is concerned, the unique requirements are GUID/P_Key. But, the P_Key doesn't show up in the LLADDR, only in the broadcast address, so only the GUID is checked in the LLADDR. All IPoIB devices are either parent devices (meaning no PKEY field specified in the config file) in which case they are the first started and match the GUID, or they are PKEY devices and are started only after their parent
Re: rdma kernel tree
On Wed, May 27, 2015 at 4:13 PM, Doug Ledford dledf...@redhat.com wrote: On Wed, 2015-05-27 at 12:50 +0300, Or Gerlitz wrote: Doug, Do you have everything in place such that what you stage for upstream is subject to linux-next auto merge tests and such? Or. No, I have 0day in place, but not linux-next to my knowledge (unless Stephen already set it up, but I haven't asked him to yet). Oh, please rush to make sure this is in place, Stephen, can you set it up so that the linux-next tag from git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added to the linux-next testing please? -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote: On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote: On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote: + struct srcu_struct disassociate_srcu; There is no need for rcu for this, use a rw sem. The rcu was used becuase it's on the hot path I assume. Perhaps, I looked at that a bit, it was used on syscall paths, but that wasn't even the big reason I made the comment.. The use of RCU is *strange*. There is no pointer, there is no call to rcu_derference. Stuff protected by the write lock is being touched without any read side locking (ie ib_uverbs_event_read sure looks funky). There is no Copy or Update phase as far as I can see. It fails two of the tests in Documentation/RCU/checklist.txt. I'm pretty sure what is intended here is that disassociated is actually an unlocked atomic boolean and the RCU is being used (abused?) to create both exclusion and a rendezvous.. I gave up at that point, if RCU is not being used properly, asking for rwsem is a way to force the author to deal with their locking sanely. Jasno -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote: Because the QPN is part of the LLADDR IB can create two interfaces on the same physical port that are completely separated by hardware. Read Haggi's email, he explains how they plan to use this to create interfaces that can be delegated to namespaces. It is not a bad idea really.. Yes, it is actually. The whole reason we went to GUID matching long ago was because of this exact issue. I reflected on this some more last night, and yes, I am leaning toward 'bad idea' direction too. Too much stuff breaks if you create multiple children with the same pkey/guid: - RDMA CM cannot disambiguate CM packets between them - DHCP cannot tell them apart - Net scripts/network manager won't work - IPv6 becomes totally broken That means the namespace stuff will have to create children using GUID aliases.. The *only* way this will ever be a workable item is if we A) reserve a number of queue pairs from the driver specifically for IPoIB use and B) specify which queue pairs go to which IPoIB devices at IPoIB module load This basic idea is exactly why I think we should stick with the 20 byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the VF what QPN to use for IPoIB (if we ever see HW support to implement that) If we use 8 bytes then that route is closed off forever. Not quite, in the 20 byte format the 8 bytes of the GUID are in the last 8/20 bytes, so the app would have to place 12 zeros and then the GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID) This is why the question of 'what is ILFA_VF_MAC' is so important, every option presented (MAC,GUID,LLADDR) are incompatible with each other. For Ethernet devices, it's the MAC. The equivalent of MAC on IB is the GUID. I would leave it at that. Yes, both arguments can be made: - Our netlink end point is targetting an IPoIB interface, and the equivelent to an Ethernet MAC in IPoIB language is the LLADDR. - Our netlink interface is targetting the hardware under the IPoIB interface and that MAC equivilent is the GUID IPoIB devices are constructs on top of the GUID/link, and you can have 10 IPoIB interfaces between the parent and children, but we don't need to specify all of those LLADDRs, we just need to give a unique GUID and allow the guest OS to create their own IPoIB devices on top of that. As I've said, I would like to see netdev review that idea before we merge any patches.. There are pragmatic downsides to the 8 byte choice: Userspace completely looses the ability to size the address without a table based on link type. That is terrible in the context of netlink's design. For instance iproute2 would need IB specific code to format the 'ip link show' (review print_vfinfo in iproute2) and to length check 'ip link set vf mac' If we do use 8, then it would be ideal (and my strong preference) to also fix the IFLA_VF_MAC message to have a working length. I think that could be done compatibly with a bit of work. At least that way iproute2 can be kept clean when it learns to do IB, and we could have the option again of using 20 someday if we need. So to be clear, to go with the 8 byte option I suggest: - Engage netdev/iproute and confirm they are philosophically OK with IFLA_VF_MAC != IFLA_ADDRESS - Make a kernel patch to properly size the IFLA_VF_MAC message - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) - Drop in the IB patch Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Wed, May 27, 2015 at 9:48 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote: On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote: Jason, you (U2 BTW) play really, really hard - refusing to say **one** word on your approach towards the built-in udata mechanism for uverbs which I asked you to comment on. And I asked to see the user space side and you have angrily refused every time. AFAIR I never ever refused to show any piece of code which went under my hands towards Linux to any-one. For future reference, when someone asks a question and you go off on an tangental rant and ignore the question, then that process repeats, still without answering the question - most english speakers would call that refusing to answer the question. It is not looked upon kindly. Jason, It's not that you asked to see the code ala hey, do you happen to have a git with the user space code for people to inspect while doing the review on the kernel part, but rather U2 saying in a definitive manner that posting the user space code should be imposed as pre-requirement to acceptance of the kernel parts. I really didn't Or: First ask: 'We can't really look at the uapi changes here without also seeing the verbs side changes.' (I know others on the list feel the same, so I use 'we') Second ask: 'Lets see the verbs side and I'll let you know.' (.. to your questions based on my review comments ..) Third ask: 'I think we need to have the same policy.' (.. To Yann's point that other kernel communities have a mandatory UAPI policy) Fourth (exasperated) ask: 'Was it not clear? Yann and I asked to see the user space side before reviewing this series further.' I know you are ESL, and I cut you alot of slack, but *come on* - that is incredibly soft language, and certainly not bossing and imposing in a definitive manner a blanket requirement on all patches. Jason, ESL indeed am I, and in that respect, this clarification, even if being tedious to set or read, helps. Still, I'd like to further try and get you from where the bossing thing came: (1) show me the user space code prior to acceptance of the kernel part never was a requirement on this community since the day we were born (Q4/2004) (2) instantly rejecting a usage of a mechanism existing just for that use case since the first year of our life (2005) is something perceived by me as two people (that's the U2) that come and say, game's over, the old, non-functioning boss is gone, new boss (== we) in town and forget about everything you knew before. So you say that is was wrong perception, I hope so. Let's see how Doug see your feedback, namely either as community reviewer feedback or as new rules being set overnight, waiting. This series is (1) simple compared to other stuff being reviewed here nowadays, and (2) has very nice value to latency sensitive applications, so two wins, lets get it done. Advice: You would be well served to spend a bit more time on your emails. I have no idea what 'but rather U2 saying' means, for instance. Sometimes I just guess at what you are trying to say :| point taken That is the price we pay for an inclusive international community, but everyone needs to be careful before starting a flame war based on percived slight in the text and phrasing of a message. email is hard. point taken. In this case (as you can see from my response above) I am not convinced yet that this was false positive. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
On 5/27/2015 7:04 PM, Doug Ledford wrote: On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote: On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote: + struct srcu_struct disassociate_srcu; There is no need for rcu for this, use a rw sem. The rcu was used becuase it's on the hot path I assume. Do we have numbers on whether a rwsem vs. an rcu matters performance wise? If the rcu actually helps performance, then I'm inclined to leave it, but if it doesn't, then I'd agree that a rwsem is simpler and easier to deal with. That's correct, it was chosen from performance reasons to enable parallel commands as part of ib_uverbs_write with minimum synchronization overhead comparing the rwsem. Most of its usage is for read, upon low level device removal there is only one point when synchronize_srcu is used. @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct ib_uverbs_file *file, return -EFAULT; } + /* Taking ref count on uverbs_file to make sure that file won't be +* freed till that event file is closed. It will enable accessing the +* uverbs_device fields as part of closing the events file and making +* sure that uverbs device is available by that time as well. +* Note: similar is already done for the async event file. + */ + kref_get(file-ref); Is this a bug today? It doesn't look like it, but this stuff does look wrong. Woulnd't this would make more sense for ib_uverbs_alloc_event_file to unconditionally grab the kref and unconditionally release it on release? The existing code for this looks broken, in ib_uverbs_get_context all the error paths between ib_uverbs_alloc_event_file and the kref_get(file-ref) are wrong - the will result in fput() which will call ib_uverbs_event_close, which will try to do kref_put and ib_unregister_event_handler - which are no longer paired. [I recommend moving the kref_get and ib_register_event_handler into ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths are clearly paired instead of being partially open coded in call sites] Fix all this in a seperate patch to add the needed change in kref semantics please. Seconded. OK, will add a separate patch to handle that. - if (!try_module_get(dev-ib_dev-owner)) { - ret = -ENODEV; + mutex_lock(dev-disassociate_mutex); + if (dev-disassociated) { + ret = -EIO; goto err; } - file = kmalloc(sizeof *file, GFP_KERNEL); + /* In case IB device supports disassociate ucontext, there is no hard +* dependency between uverbs device and its low level device. +*/ + module_dependent = !(dev-flags UVERBS_FLAG_DISASSOCIATE); + + if (module_dependent) { + if (!try_module_get(dev-ib_dev-owner)) { + ret = -ENODEV; + goto err; Again? Why I do I keep pointing this same basic thing to Mellanox people: If you hold a X then you hold the ref to X as well. So, if the core code is holding function pointers to module code, then the core code holds a module ref. When the core code null's those function pointers, then it can release the module ref. Seconded. The module get/put mechanism was previously used here and wasn't introduced by that patch. In case the low level driver can be unloaded (e.g. rmmod mlx4_ib) unconditionally to active uverbs clients we should prevent the module dependency by ignoring the get/put mechanism. Hope that it clarifies the usage here. This might work today like this (I'm not entirely sure), but it makes no sense at all. I'll look more closely in a few weeks once the rwsem change is done. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications
On Wed, May 27, 2015 at 8:43 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote: On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote: On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote: + struct srcu_struct disassociate_srcu; There is no need for rcu for this, use a rw sem. The rcu was used becuase it's on the hot path I assume. Perhaps, I looked at that a bit, it was used on syscall paths, but that wasn't even the big reason I made the comment.. Doug, what hot path we have in uverbs?! IB's stack hot path goes from user-space to the HW, right? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdma kernel tree
On Thu, 2015-05-28 at 11:02 +1000, Stephen Rothwell wrote: Hi Doug, On Wed, 27 May 2015 09:13:29 -0400 Doug Ledford dledf...@redhat.com wrote: Stephen, can you set it up so that the linux-next tag from git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added to the linux-next testing please? I haven't fetched tags in the past, only branches (it is a little tricky, but I can work on it) and there is no linux-next tag in that tree anyway (I assume that you mean for-next). Sorry, you are correct, for-next is the right tag. Also, how does this relate to Roland's infiniband tree (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next) that is already in linux-next? Roland has stepped down as maintainer. The community selected me to try to pick up where he has left off. Added from today. Thanks for adding your subsystem tree as a participant of linux-next. As you may know, this is not a judgment of your code. The purpose of linux-next is for integration testing and to lower the impact of conflicts between subsystems in the next merge window. You will need to ensure that the patches/commits in your tree/series have been: * submitted under GPL v2 (or later) and include the Contributor's Signed-off-by, * posted to the relevant mailing list, * reviewed by you (or another maintainer of your subsystem tree), * successfully unit tested, and * destined for the current or next Linux merge window. Basically, this should be just what you would send to Linus (or ask him to fetch). It is allowed to be rebased if you deem it necessary. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: rdma kernel tree
On Thu, 2015-05-28 at 11:40 +1000, Stephen Rothwell wrote: Hi Doug, Roland, On Wed, 27 May 2015 21:14:03 -0400 Doug Ledford dledf...@redhat.com wrote: Also, how does this relate to Roland's infiniband tree (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next) that is already in linux-next? Roland has stepped down as maintainer. The community selected me to try to pick up where he has left off. So, I assume that I should remove the infiniband tree from linux-next? Here's the link to Roland's announcement: http://marc.info/?l=linux-rdmam=142860130010529w=2 Based upon his comments there, I would say it's a good bet that he doesn't plan to continue hosting his own tree now that the switch over has happened. But, of course, he can speak for himself. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
RE: [PATCH for-4.1 10/11] RDMA/ocrdma: Prevent returning failures in the resource destroy path
Hi Jason My reply is delayed since I was on vacation. -Original Message- From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] Sent: Tuesday, May 19, 2015 11:53 PM To: Selvin Xavier Cc: Doug Ledford; linux-rdma@vger.kernel.org Subject: Re: [PATCH for-4.1 10/11] RDMA/ocrdma: Prevent returning failures in the resource destroy path On Mon, May 18, 2015 at 05:16:14PM +0530, Selvin Xavier wrote: The driver is returning success for other destroy routines also and free the corresponding driver resources. As part of the changes suggested by you, we will change all the destroy routines to return error status for all the FW command failures and avoid freeing the driver resources in case of failure. Since this change need more testing, I will drop this patch from the series for 4.1 RC. I will cut a patch for 4.2 after our testing. Didn't we discuss this already? Destroy routines cannot fail. If the card's firmware fails during destroy then that is an unrecoverable error and your driver and firmware are no longer synchronized. Panic the kernel or do some kind of (synchronous) card-reset recovery. Don't expect the caller to handle this, and don't leak IB core memory when it happens. Got it . Will do the same in all destroy APIs. Jason Thanks, Selvin -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdma kernel tree
Hi Doug, On Wed, 27 May 2015 09:13:29 -0400 Doug Ledford dledf...@redhat.com wrote: Stephen, can you set it up so that the linux-next tag from git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added to the linux-next testing please? I haven't fetched tags in the past, only branches (it is a little tricky, but I can work on it) and there is no linux-next tag in that tree anyway (I assume that you mean for-next). Also, how does this relate to Roland's infiniband tree (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next) that is already in linux-next? Added from today. Thanks for adding your subsystem tree as a participant of linux-next. As you may know, this is not a judgment of your code. The purpose of linux-next is for integration testing and to lower the impact of conflicts between subsystems in the next merge window. You will need to ensure that the patches/commits in your tree/series have been: * submitted under GPL v2 (or later) and include the Contributor's Signed-off-by, * posted to the relevant mailing list, * reviewed by you (or another maintainer of your subsystem tree), * successfully unit tested, and * destined for the current or next Linux merge window. Basically, this should be just what you would send to Linus (or ask him to fetch). It is allowed to be rebased if you deem it necessary. -- Cheers, Stephen Rothwell s...@canb.auug.org.au pgpwzMf1pUkqU.pgp Description: OpenPGP digital signature
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Thu, May 28, 2015 at 12:58:59AM +0300, Or Gerlitz wrote: On Wed, May 27, 2015 at 8:53 PM, Doug Ledford dledf...@redhat.com wrote: On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: Well, let's just be clear: netlink/iproute2 screwed the pooch on their implementation of this stuff. Any solution that doesn't include fixing this up in some way is not really a good solution. For instance iproute2 would need IB specific code to format the 'ip link show' (review print_vfinfo in iproute2) and to length check 'ip link set vf mac' If we do use 8, then it would be ideal (and my strong preference) to also fix the IFLA_VF_MAC message to have a working length. I think that could be done compatibly with a bit of work. At least that way iproute2 can be kept clean when it learns to do IB, and we could have the option again of using 20 someday if we need. Sounds like a reasonable plan. Or, this is your patch set, are you going to pick up these action items? Let see So to be clear, to go with the 8 byte option I suggest: - Engage netdev/iproute and confirm they are philosophically OK with IFLA_VF_MAC != IFLA_ADDRESS the last thing netdev are is having philosophical views, they are very practical (and non-perfect and happy and lively community), the one thing before the last netdev are is caring for the rdma subsystem. If something has to change @ their direct part, we should come with patches. Yes, you need to talk in patches, I will suggest you start with this (totally untested, quickly thrown together): diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 8de36824018d..715b59e9925a 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1172,7 +1172,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev, nla_nest_cancel(skb, vfinfo); goto nla_put_failure; } - if (nla_put(skb, IFLA_VF_MAC, sizeof(vf_mac), vf_mac) || + if (nla_put(skb, IFLA_VF_MAC, sizeof(u32) + dev-addr_len, vf_mac) || nla_put(skb, IFLA_VF_VLAN, sizeof(vf_vlan), vf_vlan) || nla_put(skb, IFLA_VF_RATE, sizeof(vf_rate), vf_rate) || @@ -1292,7 +1292,8 @@ static const struct nla_policy ifla_vfinfo_policy[IFLA_VF_INFO_MAX+1] = { }; static const struct nla_policy ifla_vf_policy[IFLA_VF_MAX+1] = { - [IFLA_VF_MAC] = { .len = sizeof(struct ifla_vf_mac) }, + [IFLA_VF_MAC] = { .type = NLA_BINARY, + .len = sizeof(struct ifla_vf_mac) }, [IFLA_VF_VLAN] = { .len = sizeof(struct ifla_vf_vlan) }, [IFLA_VF_TX_RATE] = { .len = sizeof(struct ifla_vf_tx_rate) }, [IFLA_VF_SPOOFCHK] = { .len = sizeof(struct ifla_vf_spoofchk) }, @@ -1447,6 +1448,21 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr *attr) switch (nla_type(vf)) { case IFLA_VF_MAC: { struct ifla_vf_mac *ivm; + int sz; + + /* Legacy compatability, old iproute2 will pass in 32 +* bytes for everything, assume it is a 6 byte MAC +* because that is all old iproute really supported. +*/ + sz = nla_len(attr); + if (sz == 36) + sz = 10; + + if (sz != (sizeof(u32) + dev-addr_len)) { + err = -EINVAL; + break; + } + ivm = nla_data(vf); err = -EOPNOTSUPP; if (ops-ndo_set_vf_mac) And make the argument why the above is OK. - Reader side only ever read 6 bytes anyhow, so it doesn't matter that ifla_vf_mac is undersized in the RTA_DATA - Confirm any other users of this (openstack?) you know are similarly OK. - Old write side will always send the full 36 byte struct, so assume 6 bytes. - Justify this as needing to add IB support next which does not use a 6 byte addr_len New code that wants to work with IB will have to use correct lengths. Then based on that iproute2 can be fixed in the two areas I pointed out, doing that at the same time will probably earn some rep. Finally you can make a patch that switches addr_len to something else - that would trigger the philosophical question of 'what actually is a IFLA_VF_MAC', if everyone is OK with changing it then golden. - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) I was thinking to patch iproute to sense the link type: if eth print six bytes, if ipoib print 8 bytes, simple. This
Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs
On Wed, May 27, 2015 at 8:53 PM, Doug Ledford dledf...@redhat.com wrote: On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote: Well, let's just be clear: netlink/iproute2 screwed the pooch on their implementation of this stuff. Any solution that doesn't include fixing this up in some way is not really a good solution. For instance iproute2 would need IB specific code to format the 'ip link show' (review print_vfinfo in iproute2) and to length check 'ip link set vf mac' If we do use 8, then it would be ideal (and my strong preference) to also fix the IFLA_VF_MAC message to have a working length. I think that could be done compatibly with a bit of work. At least that way iproute2 can be kept clean when it learns to do IB, and we could have the option again of using 20 someday if we need. Sounds like a reasonable plan. Or, this is your patch set, are you going to pick up these action items? Let see So to be clear, to go with the 8 byte option I suggest: - Engage netdev/iproute and confirm they are philosophically OK with IFLA_VF_MAC != IFLA_ADDRESS the last thing netdev are is having philosophical views, they are very practical (and non-perfect and happy and lively community), the one thing before the last netdev are is caring for the rdma subsystem. If something has to change @ their direct part, we should come with patches. - Make a kernel patch to properly size the IFLA_VF_MAC message You mean the below structure which is expected by the kernel after they see the IFLA_VF_MAC NL attribute struct ifla_vf_mac { __u32 vf; __u8 mac[32]; /* MAX_ADDR_LEN */ }; How you thought to patch things such that the size of the address provided by user-space will propagate into the kernel w.o breaking the NL ABI here? Why not just take the eight lower bytes and set them NL -- ipoib -- PF driver - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat) I was thinking to patch iproute to sense the link type: if eth print six bytes, if ipoib print 8 bytes, simple. - Drop in the IB patch sounds good. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Thu, May 28, 2015 at 12:33:58AM +0300, Or Gerlitz wrote: (2) instantly rejecting a usage of a mechanism existing just for that use case since the first year of our life (2005) I didn't reject anything, I said I wanted to see the user side to continue my review. This series is (1) simple compared to other stuff being reviewed here nowadays, What? It is 11 patches, long, introduces several UAPI changes, does not implement a standardized feature, adds new uses of latent kernel functions and exists to support a unique feature of a single hardware vendor that few understand the usecase for. I would describe this as one of the toughest to review series on patchworks right now. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html