[PATCH] qed: Fix potential use-after-free in qed_spq_post()
From: Roland Dreier <rol...@purestorage.com> We need to check if p_ent->comp_mode is QED_SPQ_MODE_EBLOCK before calling qed_spq_add_entry(). The test is fine is the mode is EBLOCK, but if it isn't then qed_spq_add_entry() might kfree(p_ent). Signed-off-by: Roland Dreier <rol...@purestorage.com> --- drivers/net/ethernet/qlogic/qed/qed_spq.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c b/drivers/net/ethernet/qlogic/qed/qed_spq.c index be48d9abd001..3588081b2e27 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_spq.c +++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c @@ -776,6 +776,7 @@ int qed_spq_post(struct qed_hwfn *p_hwfn, int rc = 0; struct qed_spq *p_spq = p_hwfn ? p_hwfn->p_spq : NULL; bool b_ret_ent = true; + bool eblock; if (!p_hwfn) return -EINVAL; @@ -794,6 +795,11 @@ int qed_spq_post(struct qed_hwfn *p_hwfn, if (rc) goto spq_post_fail; + /* Check if entry is in block mode before qed_spq_add_entry, +* which might kfree p_ent. +*/ + eblock = (p_ent->comp_mode == QED_SPQ_MODE_EBLOCK); + /* Add the request to the pending queue */ rc = qed_spq_add_entry(p_hwfn, p_ent, p_ent->priority); if (rc) @@ -811,7 +817,7 @@ int qed_spq_post(struct qed_hwfn *p_hwfn, spin_unlock_bh(_spq->lock); - if (p_ent->comp_mode == QED_SPQ_MODE_EBLOCK) { + if (eblock) { /* For entries in QED BLOCK mode, the completion code cannot * perform the necessary cleanup - if it did, we couldn't * access p_ent here to see whether it's successful or not. -- 2.14.1
Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs
> I think the conclusion is that a hard-wired ACS capability is a > positive indication of isolation for a multifunction device, the code > is intended to support this and appears to do so, and Roland was going > to investigate the sightings that inspired this patch in more detail. > Dropping for now is appropriate. Thanks, That's right. I confirmed that the issue we found was due to another PCI quirk. It may make sense to add more 82599 variants to the table, but the X540 and X550 work without a quirk. Sorry for the noise. - R.
Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs
> Is there a misunderstanding of the code flow here? We're never setting > EC. In the first code block we're just masking out requested > capabilities where unimplemented capabilities is the same as > implemented + enabled. We're not adding EC to the request, we're > just not removing it based on the implemented capabilities because we > don't interpret it the same way. Thus if someone wants to test a > device for EC, it really needs to implement EC, we cannot assume it > based on lack of support for EC in the ACS capability. As you point > out, nobody really cares about EC yet though. I guess I find the semantics confusing. For every other bit, pci_acs_enabled() returns true if the bit is either enabled or not implemented. For EC, it returns false if the bit is not implemented. It's not clear to me what the use case for checking for PCI_ACS_EC enabled would be. Seems like checking for EC in the capabilities register would be more useful. - R.
Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs
> I think that the device implementing ACS and not implementing certain > features that are "must be implemented if..." is a positive indication > that the device does not have that behavior and therefore the ability > to control that behavior is unnecessary. pci_acs_flags_enabled() is > coded with this intention: > > static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags) > { > int pos; > u16 cap, ctrl; > > pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS); > if (!pos) > return false; > > /* > * Except for egress control, capabilities are either required > * or only required if controllable. Features missing from the > * capability field can therefore be assumed as hard-wired enabled. > */ > pci_read_config_word(pdev, pos + PCI_ACS_CAP, ); > acs_flags &= (cap | PCI_ACS_EC); > > > > pci_read_config_word(pdev, pos + PCI_ACS_CTRL, ); > return (ctrl & acs_flags) == acs_flags; > > I'm not sure it makes sense to look for the EC bit in the control register if the capability bit says it is not supported. In fact I think it would violate the PCI spec to set the bit in the control register if it is zero in the capability register - the spec says: Default value of this bit is 0b. Must be hardwired to 0b if the ACS P2P Egress Control functionality is not implemented. I don't understand the reasoning behind forcing the EC bit in commit 83db7e0bdb70. I think the only reason this works is that nowhere in the kernel uses this function to check PCI_ACS_EC. > At least that's my interpretation and how I've indicated to folks at > Intel how things should work for a device that does not support p2p. > Of course this all hinges on having an ACS capability. Without an ACS > capability we must assume any sort of p2p is possible unless we have a > quirk to tell us otherwise. With ACS, the absence of capabilities is a > positive indication that those forms of p2p are not possible (depending > on the spec wording for that capability). If the code doesn't behave > that way, let's fix it. Thanks, Thanks - I'm looking more closely at what goes wrong with X550, since from reading the code it looks like it should work, but people have observed that it doesn't on our system. However the issue might be elsewhere. However I'll send a patch removing that "| PCI_ACS_EC" from the check if you agree - maybe I'm misunderstanding the logic but I don't see how it could work if it ever became relevant. - R.
Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs
On Thu, Jul 20, 2017 at 3:15 PM, Alex Williamsonwrote: > Most of the ACS capabilities are worded as "Must be implemented by > devices that implement ..." Shouldn't a hard-wired ACS capability > sufficiently describe that, or is there something wrong with how > they're hard wired? Interesting question. I just looked at what the PCI spec says about the various bits for ACS functions in multi-function devices. Many of the functions are "must not be implemented." Of the ones that are "must be implemented if..." the key is that the if is for devices that support peer-to-peer traffic. I think the Intel NICs are compliant with the spec - they don't support any ACS mechanisms for controlling peer-to-peer traffic, but they also don't implement peer-to-peer traffic. This means that the PCI standard way of knowing that it is safe to assign individual functions does not prove it is safe - but with device-specific knowledge we do know it is safe. Hence a quirk to give that device-specific information to the kernel. - R.
[PATCH] PCI: Update ACS quirk for more Intel 10G NICs
From: Roland Dreier <rol...@purestorage.com> Add one more variant of the 82599 plus the device IDs for X540 and X550 variants. Intel has confirmed that none of these devices does peer-to-peer between functions. The X540 and X550 have added ACS capabilities in their PCI config space, but the ACS control register is hard-wired to 0 for both devices, so we still need the quirk for IOMMU grouping to allow assignment of individual SR-IOV functions. Signed-off-by: Roland Dreier <rol...@purestorage.com> --- drivers/pci/quirks.c | 21 + 1 file changed, 21 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6967c6b4cf6b..b939db671326 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled { { PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x152A, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x154D, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x154F, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x1551, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x1558, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x1560, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x1563, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AA, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AC, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AD, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AE, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15B0, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C2, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C3, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C4, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C6, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C7, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15C8, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15CE, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15E4, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15E5, pci_quirk_mf_endpoint_acs }, + { PCI_VENDOR_ID_INTEL, 0x15D1, pci_quirk_mf_endpoint_acs }, /* 82580 */ { PCI_VENDOR_ID_INTEL, 0x1509, pci_quirk_mf_endpoint_acs }, { PCI_VENDOR_ID_INTEL, 0x150E, pci_quirk_mf_endpoint_acs }, -- 2.11.0
Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Fri, Jul 8, 2016 at 9:51 AM, Jason Gunthorpewrote: > So, it appears, the dst and neigh can be used for all performances cases. > > For the non performance dst == null case, can we just burn cycles and > stuff the daddr in front of the packet at hardheader time, even if we > have to copy? OK, sounds interesting. Unfortunately the scope of this work has gotten to the point where I can't take it on right now. My system is running 4.4.y for now (before struct skb_gso_cb grew) so I think shrinking struct skb_gso_cb to 8 bytes plus changing SKB_SGO_CB_OFFSET to 20 will work for now. Hope someone is able to come up with a real fix before I need to upgrade to 4.10.y... - R.
Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpewrote: > We have neighbour_priv, and ndo_neigh_construct/destruct now .. > > A first blush that would seem to be enough to let ipoib store the AH > and other path information in the neigh and avoid the cb? At least the > example in clip sure looks like what ipoib needs to do. Do you think those new facilities let us go back to using the neigh and still avoid the issues that led to commit b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path")? - R.
Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
>> struct skb_gso_cb { >> int mac_offset; >> int encap_level; >> __u16 csum_start; >> }; > This is based on an out-dated version of this struct. The 4.7 RC > kernel has a few more fields that were added to support local checksum > offload for encapsulated frames. Thanks for pointing that out. I hit the perf regression on 4.4.y (stable) and looked at the struct there. I see that latest upstream has changed, and I agree that this struct really can't shrink below 10 bytes. Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes, we're 2 bytes over the 48 that are available in cb[]. So this is harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET unfortunately. >> What is the best way to keep the crash fix but not kill IPoIB performance? > > It seems like what would probably need to happen is to move where the > IPoIB address is stored. I'm not sure the control buffer is really > the best place for it since the cb gets overwritten at various levels, > and storing 20 bytes makes it hard to avoid bumping up against the > size restrictions of the buffer. Seeing as how the IPoIB hwaddr is > generated around the same time we generate the L2 header for the > frame, I wonder if you couldn't get away with using a bit of extra skb > headroom to store it and then use a offset from the MAC header to > access it. An added bonus would be that with a few tricks with > SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so > that you copy the hwaddr when you copy the header for each fragment > instead of having to go and copy the hwaddr out of the cb and clone it > for each frame. Can we assume there are 20 bytes of skb headroom available? What if we're forwarding an skb received on an Ethernet device? The reason we moved to the cb storage is that in the past, trying to hide some data in the actual skb buffer that we don't actually send led to some awkward-at-best code. (As I recall GRO was difficult to handle before commit 936d7de3d736 "IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses") But maybe there's a third way to handle this other than the old way and the skb->cb way. - R.
Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation
On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikovwrote: > Or just shift GSO CB and add couple checks like > BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb))); Resurrecting this old thread, because the patch that ultimately went upstream (commit 9207f9d45b0a / net: preserve IP control block during GSO segmentation) causes a huge IPoIB performance regression (to the point of being unusable): https://bugzilla.kernel.org/show_bug.cgi?id=111921 I don't think anyone has explained what goes wrong or why IPoIB works the way it does. The underlying difference that IPoIB has from other drivers is that there are two levels of address resolution. First, normal ARP / ND resolves an IP address to a "hardware" address. The difference is that in IPoIB, the hardware address is an IB GID (plus a QPN, but we can ignore that). To actually send data to that GID, the IPoIB driver has to do a second lookup - it needs to ask the IB subnet manager for a path record that tells it how to reach that GID. In particular this means that "destination address" (as the IP / ARP layer understands it) actually isn't in the packet anywhere - there's nothing like an ethernet header as there is for "normal" network drivers. Instead, the driver stashes the address in skb->cb during hard_header_ops->create() and then looks at it in the xmit routine - this was designed way back around when commit a0417fa3a18a / net: Make qdisc_skb_cb upper size bound explicit. was merged. The expectation was that the part of the cb after sizeof (struct qdisc_skb_cb) would be preserved. The problem with commit 9207f9d45b0a is that GSO operations now access cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of where IPoIB stashes its hwaddr. It seems that the intent of the commit is to preserve the IP control block - struct inet_skb_parm (and presumably struct inet6_skb_parm) - even when using SKB_GSO_CB(). Seems like both inet_skb_parm and inet6_skb_parm are 20 bytes. IPoIB uses the part of cb after 28 bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and set SKB_SGO_CB_OFFSET to 20, then everything would work. The struct is struct skb_gso_cb { int mac_offset; int encap_level; __u16 csum_start; }; is it feasible to make encap_level a __u16 (which would make the overall struct exactly 8 bytes)? If I understand this correctly, 64K nested encapsulations seems like quite a bit for a packet... Or, earlier in this thread, having the GSO in ip_output and other gso paths save and restore the IP/IP6 control block was suggested as an alternate approach. I don't know if there are performance implications to that. What is the best way to keep the crash fix but not kill IPoIB performance? Thanks! - R.
[PATCH 3.19 and earlier] fib_rules: Fix dump_rules() not to exit early
From: Roland Dreier <rol...@purestorage.com> Backports of 41fc014332d9 ("fib_rules: fix fib rule dumps across multiple skbs") introduced a regression in "ip rule show" - it ends up dumping the first rule over and over and never exiting, because 3.19 and earlier are missing commit 053c095a82cf ("netlink: make nlmsg_end() and genlmsg_end() void"), so fib_nl_fill_rule() ends up returning skb->len (i.e. > 0) in the success case. Fix this by checking the return code for < 0 instead of != 0. Signed-off-by: Roland Dreier <rol...@purestorage.com> --- Hi, this is needed for all stable trees earlier than 4.0 that have picked up 41fc014332d9; so far looks like at least 3.10.y and 3.14.y have made such releases. net/core/fib_rules.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 627e517077e4..84340a2605ed 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -606,7 +606,7 @@ static int dump_rules(struct sk_buff *skb, struct netlink_callback *cb, err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWRULE, NLM_F_MULTI, ops); - if (err) + if (err < 0) break; skip: idx++; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] fib_rules: fix fib rule dumps across multiple skbs
On Tue, Sep 22, 2015 at 9:40 PM, Roopa Prabhuwrote: > + err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid, > + cb->nlh->nlmsg_seq, RTM_NEWRULE, > + NLM_F_MULTI, ops); > + if (err) FWIW I believe this breaks pre-4.0 stable kernels (unfortunately it just showed up in 3.10.90). In kernels without 053c095a82cf ("netlink: make nlmsg_end() and genlmsg_end() void") then fib_nl_fill_rule() returns a positive value (skb->len) on success, so we break out of the loop here immediately. Symptom is "ip rule show" loops forever printing the first rule. After I finish testing a fix (as trivial as changing to "if (err < 0)") here, I'll send it to -stable guys. - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Shift calculation wrong for single sge entries.
Thanks, applied, although I assume based on the Signed-off-by line that you left out a From: Bryan Rosenburg [EMAIL PROTECTED] at the top (to get the authorship in git correctly). RDMA/cxgb3: Shift calculation wrong for single sge entries. BTW, there's no need to duplicate the subject line in the message body, but if you are going to do it, please put a Subject: before it. Otherwise I just have to edit it out by hand to avoid git putting the subject in twice. - R. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND] libertas: Remove unused exports
Any chance of getting this applied? It seems the build is still broken on ia64 at least due to the export of static functions. --- The libertas driver exports a number of symbols with no in-tree users; remove these unused exports. lbs_reset_device() is completely unused, with no callers at all, so remove the function completely. A couple of these unused exported symbols are static, which causes the following build error on ia64 with gcc 4.2.3: drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh causes a section type conflict drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh causes a section type conflict This patch fixes this problem. I don't have hardware, so this is not run-tested, but I tested the build with CONFIG_LIBERTAS=y CONFIG_LIBERTAS_USB=m CONFIG_LIBERTAS_CS=m CONFIG_LIBERTAS_SDIO=m and there were no problems with undefined symbols. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- diff --git a/drivers/net/wireless/libertas/cmd.c b/drivers/net/wireless/libertas/cmd.c index eab0203..b3c1acb 100644 --- a/drivers/net/wireless/libertas/cmd.c +++ b/drivers/net/wireless/libertas/cmd.c @@ -1040,7 +1040,6 @@ int lbs_mesh_access(struct lbs_private *priv, uint16_t cmd_action, lbs_deb_leave(LBS_DEB_CMD); return ret; } -EXPORT_SYMBOL_GPL(lbs_mesh_access); int lbs_mesh_config(struct lbs_private *priv, uint16_t enable, uint16_t chan) { @@ -1576,7 +1575,6 @@ done: lbs_deb_leave_args(LBS_DEB_HOST, ret %d, ret); return ret; } -EXPORT_SYMBOL_GPL(lbs_prepare_and_send_command); /** * @brief This function allocates the command buffer and link diff --git a/drivers/net/wireless/libertas/decl.h b/drivers/net/wireless/libertas/decl.h index aaacd9b..4e22341 100644 --- a/drivers/net/wireless/libertas/decl.h +++ b/drivers/net/wireless/libertas/decl.h @@ -69,7 +69,6 @@ struct lbs_private *lbs_add_card(void *card, struct device *dmdev); int lbs_remove_card(struct lbs_private *priv); int lbs_start_card(struct lbs_private *priv); int lbs_stop_card(struct lbs_private *priv); -int lbs_reset_device(struct lbs_private *priv); void lbs_host_to_card_done(struct lbs_private *priv); int lbs_update_channel(struct lbs_private *priv); diff --git a/drivers/net/wireless/libertas/main.c b/drivers/net/wireless/libertas/main.c index 84fb49c..1eaf6af 100644 --- a/drivers/net/wireless/libertas/main.c +++ b/drivers/net/wireless/libertas/main.c @@ -1351,8 +1350,6 @@ done: lbs_deb_leave_args(LBS_DEB_MESH, ret %d, ret); return ret; } -EXPORT_SYMBOL_GPL(lbs_add_mesh); - static void lbs_remove_mesh(struct lbs_private *priv) { @@ -1372,7 +1369,6 @@ static void lbs_remove_mesh(struct lbs_private *priv) free_netdev(mesh_dev); lbs_deb_leave(LBS_DEB_MESH); } -EXPORT_SYMBOL_GPL(lbs_remove_mesh); /** * @brief This function finds the CFP in @@ -1458,20 +1454,6 @@ void lbs_interrupt(struct lbs_private *priv) } EXPORT_SYMBOL_GPL(lbs_interrupt); -int lbs_reset_device(struct lbs_private *priv) -{ - int ret; - - lbs_deb_enter(LBS_DEB_MAIN); - ret = lbs_prepare_and_send_command(priv, CMD_802_11_RESET, - CMD_ACT_HALT, 0, 0, NULL); - msleep_interruptible(10); - - lbs_deb_leave_args(LBS_DEB_MAIN, ret %d, ret); - return ret; -} -EXPORT_SYMBOL_GPL(lbs_reset_device); - static int __init lbs_init_module(void) { lbs_deb_enter(LBS_DEB_MAIN); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopbackconnections.
how can a static void function return 0? good question... I've fixed the patch in my tree. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopbackconnections.
Steve, I had to update the patch adding an include and fixing the function declaration (as below)... but how much testing have you done with this?? commit 8704e9a8790cc9e394198663c1c9150c899fb9a2 Author: Steve Wise [EMAIL PROTECTED] Date: Tue Feb 12 16:09:29 2008 -0600 RDMA/cxgb3: Fail loopback connections The cxgb3 HW and driver don't support loopback RDMA connections. So fail any connection attempt where the destination address is local. Signed-off-by: Steve Wise [EMAIL PROTECTED] Signed-off-by: Roland Dreier [EMAIL PROTECTED] diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index e9a08fa..320f2b6 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -35,6 +35,7 @@ #include linux/skbuff.h #include linux/timer.h #include linux/notifier.h +#include linux/inetdevice.h #include net/neighbour.h #include net/netevent.h @@ -1784,6 +1785,17 @@ err: return err; } +static int is_loopback_dst(struct iw_cm_id *cm_id) +{ + struct net_device *dev; + + dev = ip_dev_find(init_net, cm_id-remote_addr.sin_addr.s_addr); + if (!dev) + return 0; + dev_put(dev); + return 1; +} + int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) { int err = 0; @@ -1791,6 +1803,11 @@ int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param) struct iwch_ep *ep; struct rtable *rt; + if (is_loopback_dst(cm_id)) { + err = -ENOSYS; + goto out; + } + ep = alloc_ep(sizeof(*ep), GFP_KERNEL); if (!ep) { printk(KERN_ERR MOD %s - cannot alloc ep.\n, __FUNCTION__); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libertas: Remove unused exports
The libertas driver exports a number of symbols with no in-tree users; remove these unused exports. lbs_reset_device() is completely unused, with no callers at all, so remove the function completely. A couple of these unused exported symbols are static, which causes the following build error on ia64 with gcc 4.2.3: drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh causes a section type conflict drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh causes a section type conflict This patch fixes this problem. I don't have hardware, so this is not run-tested, but I tested the build with CONFIG_LIBERTAS=y CONFIG_LIBERTAS_USB=m CONFIG_LIBERTAS_CS=m CONFIG_LIBERTAS_SDIO=m and there were no problems with undefined symbols. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- Here's the patch that removes the unused exports (and a few more that I found). diff --git a/drivers/net/wireless/libertas/cmd.c b/drivers/net/wireless/libertas/cmd.c index eab0203..b3c1acb 100644 --- a/drivers/net/wireless/libertas/cmd.c +++ b/drivers/net/wireless/libertas/cmd.c @@ -1040,7 +1040,6 @@ int lbs_mesh_access(struct lbs_private *priv, uint16_t cmd_action, lbs_deb_leave(LBS_DEB_CMD); return ret; } -EXPORT_SYMBOL_GPL(lbs_mesh_access); int lbs_mesh_config(struct lbs_private *priv, uint16_t enable, uint16_t chan) { @@ -1576,7 +1575,6 @@ done: lbs_deb_leave_args(LBS_DEB_HOST, ret %d, ret); return ret; } -EXPORT_SYMBOL_GPL(lbs_prepare_and_send_command); /** * @brief This function allocates the command buffer and link diff --git a/drivers/net/wireless/libertas/decl.h b/drivers/net/wireless/libertas/decl.h index aaacd9b..4e22341 100644 --- a/drivers/net/wireless/libertas/decl.h +++ b/drivers/net/wireless/libertas/decl.h @@ -69,7 +69,6 @@ struct lbs_private *lbs_add_card(void *card, struct device *dmdev); int lbs_remove_card(struct lbs_private *priv); int lbs_start_card(struct lbs_private *priv); int lbs_stop_card(struct lbs_private *priv); -int lbs_reset_device(struct lbs_private *priv); void lbs_host_to_card_done(struct lbs_private *priv); int lbs_update_channel(struct lbs_private *priv); diff --git a/drivers/net/wireless/libertas/main.c b/drivers/net/wireless/libertas/main.c index 84fb49c..1eaf6af 100644 --- a/drivers/net/wireless/libertas/main.c +++ b/drivers/net/wireless/libertas/main.c @@ -1351,8 +1350,6 @@ done: lbs_deb_leave_args(LBS_DEB_MESH, ret %d, ret); return ret; } -EXPORT_SYMBOL_GPL(lbs_add_mesh); - static void lbs_remove_mesh(struct lbs_private *priv) { @@ -1372,7 +1369,6 @@ static void lbs_remove_mesh(struct lbs_private *priv) free_netdev(mesh_dev); lbs_deb_leave(LBS_DEB_MESH); } -EXPORT_SYMBOL_GPL(lbs_remove_mesh); /** * @brief This function finds the CFP in @@ -1458,20 +1454,6 @@ void lbs_interrupt(struct lbs_private *priv) } EXPORT_SYMBOL_GPL(lbs_interrupt); -int lbs_reset_device(struct lbs_private *priv) -{ - int ret; - - lbs_deb_enter(LBS_DEB_MAIN); - ret = lbs_prepare_and_send_command(priv, CMD_802_11_RESET, - CMD_ACT_HALT, 0, 0, NULL); - msleep_interruptible(10); - - lbs_deb_leave_args(LBS_DEB_MAIN, ret %d, ret); - return ret; -} -EXPORT_SYMBOL_GPL(lbs_reset_device); - static int __init lbs_init_module(void) { lbs_deb_enter(LBS_DEB_MAIN); -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopback connections.
applied, although: +static void is_loopback_dst(struct iw_cm_id *cm_id) +{ +struct net_device *dev; + +dev = ip_dev_find(init_net, cm_id-remote_addr.sin_addr.s_addr); +if (!dev) +return 0; +dev_put(dev); +return 1; +} is there any way this could trigger when it should, like if I'm trying to make a connection from one local device to a different local device (which should work fine)? - R. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: needs vmalloc.h for vmap()
thanks, just merged the same patch from Olof Johansson. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libertas: Don't mark exported symbols as static
Marking exported symbols as static causes the following build error on ia64 with gcc 4.2.3: drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh causes a section type conflict drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh causes a section type conflict Therefore, remove the static marking on lbs_remove_mesh and lbs_add_mesh. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- diff --git a/drivers/net/wireless/libertas/main.c b/drivers/net/wireless/libertas/main.c index 84fb49c..a688ce8 100644 --- a/drivers/net/wireless/libertas/main.c +++ b/drivers/net/wireless/libertas/main.c @@ -253,8 +253,8 @@ static ssize_t lbs_anycast_set(struct device *dev, static int lbs_add_rtap(struct lbs_private *priv); static void lbs_remove_rtap(struct lbs_private *priv); -static int lbs_add_mesh(struct lbs_private *priv); -static void lbs_remove_mesh(struct lbs_private *priv); +int lbs_add_mesh(struct lbs_private *priv); +void lbs_remove_mesh(struct lbs_private *priv); /** @@ -1296,7 +1296,7 @@ EXPORT_SYMBOL_GPL(lbs_stop_card); * @param privA pointer to the struct lbs_private structure * @return 0 if successful, -X otherwise */ -static int lbs_add_mesh(struct lbs_private *priv) +int lbs_add_mesh(struct lbs_private *priv) { struct net_device *mesh_dev = NULL; int ret = 0; @@ -1354,7 +1354,7 @@ done: EXPORT_SYMBOL_GPL(lbs_add_mesh); -static void lbs_remove_mesh(struct lbs_private *priv) +void lbs_remove_mesh(struct lbs_private *priv) { struct net_device *mesh_dev; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libertas: Don't mark exported symbols as static
Why not pull the exports? they aren't used anywhere in the existing kernel. I'm guessing there's some not-(yet-)merged mesh networking stuff that uses the symbols, but it doesn't matter much to me... - R. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp
On a 2GB Core2 system here I see a time cat /proc/net/tcp /dev/null constently dropping from 0.44s to 0.4-0.8s system time with this change. Seems like there must be a typo in either the before or after times you report here? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] cxgb3: Remove incorrect __devinit annotations
When PCI error recovery was added to cxgb3, a function t3_io_slot_reset() was added. This function can call back into t3_prep_adapter() at any time, so t3_prep_adapter() can no longer be marked __devinit. This patch removes the __devinit annotation from t3_prep_adapter() and all the functions that it calls, which fixes WARNING: drivers/net/cxgb3/built-in.o(.text+0x2427): Section mismatch in reference from the function t3_io_slot_reset() to the function .devinit.text:t3_prep_adapter() Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/cxgb3/mc5.c |2 +- drivers/net/cxgb3/sge.c |2 +- drivers/net/cxgb3/t3_hw.c | 22 ++ 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/drivers/net/cxgb3/mc5.c b/drivers/net/cxgb3/mc5.c index 84c1ffa..4c4d6e8 100644 --- a/drivers/net/cxgb3/mc5.c +++ b/drivers/net/cxgb3/mc5.c @@ -452,7 +452,7 @@ void t3_mc5_intr_handler(struct mc5 *mc5) t3_write_reg(adap, A_MC5_DB_INT_CAUSE, cause); } -void __devinit t3_mc5_prep(struct adapter *adapter, struct mc5 *mc5, int mode) +void t3_mc5_prep(struct adapter *adapter, struct mc5 *mc5, int mode) { #define K * 1024 diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c index cb684d3..9ca8c66 100644 --- a/drivers/net/cxgb3/sge.c +++ b/drivers/net/cxgb3/sge.c @@ -2836,7 +2836,7 @@ void t3_sge_init(struct adapter *adap, struct sge_params *p) * defaults for the assorted SGE parameters, which admins can change until * they are used to initialize the SGE. */ -void __devinit t3_sge_prep(struct adapter *adap, struct sge_params *p) +void t3_sge_prep(struct adapter *adap, struct sge_params *p) { int i; diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c index 7469935..a99496a 100644 --- a/drivers/net/cxgb3/t3_hw.c +++ b/drivers/net/cxgb3/t3_hw.c @@ -2675,7 +2675,7 @@ void t3_tp_set_max_rxsize(struct adapter *adap, unsigned int size) V_PMMAXXFERLEN0(size) | V_PMMAXXFERLEN1(size)); } -static void __devinit init_mtus(unsigned short mtus[]) +static void init_mtus(unsigned short mtus[]) { /* * See draft-mathis-plpmtud-00.txt for the values. The min is 88 so @@ -2703,7 +2703,7 @@ static void __devinit init_mtus(unsigned short mtus[]) /* * Initial congestion control parameters. */ -static void __devinit init_cong_ctrl(unsigned short *a, unsigned short *b) +static void init_cong_ctrl(unsigned short *a, unsigned short *b) { a[0] = a[1] = a[2] = a[3] = a[4] = a[5] = a[6] = a[7] = a[8] = 1; a[9] = 2; @@ -3354,8 +3354,7 @@ out_err: * Determines a card's PCI mode and associated parameters, such as speed * and width. */ -static void __devinit get_pci_mode(struct adapter *adapter, - struct pci_params *p) +static void get_pci_mode(struct adapter *adapter, struct pci_params *p) { static unsigned short speed_map[] = { 33, 66, 100, 133 }; u32 pci_mode, pcie_cap; @@ -3395,8 +3394,7 @@ static void __devinit get_pci_mode(struct adapter *adapter, * capabilities and default speed/duplex/flow-control/autonegotiation * settings. */ -static void __devinit init_link_config(struct link_config *lc, - unsigned int caps) +static void init_link_config(struct link_config *lc, unsigned int caps) { lc-supported = caps; lc-requested_speed = lc-speed = SPEED_INVALID; @@ -3419,7 +3417,7 @@ static void __devinit init_link_config(struct link_config *lc, * Calculates the size of an MC7 memory in bytes from the value of its * configuration register. */ -static unsigned int __devinit mc7_calc_size(u32 cfg) +static unsigned int mc7_calc_size(u32 cfg) { unsigned int width = G_WIDTH(cfg); unsigned int banks = !!(cfg F_BKS) + 1; @@ -3430,8 +3428,8 @@ static unsigned int __devinit mc7_calc_size(u32 cfg) return MBs 20; } -static void __devinit mc7_prep(struct adapter *adapter, struct mc7 *mc7, - unsigned int base_addr, const char *name) +static void mc7_prep(struct adapter *adapter, struct mc7 *mc7, +unsigned int base_addr, const char *name) { u32 cfg; @@ -3517,7 +3515,7 @@ static int t3_reset_adapter(struct adapter *adapter) return 0; } -static int __devinit init_parity(struct adapter *adap) +static int init_parity(struct adapter *adap) { int i, err, addr; @@ -3552,8 +3550,8 @@ static int __devinit init_parity(struct adapter *adap) * for some adapter tunables, take PHYs out of reset, and initialize the MDIO * interface. */ -int __devinit t3_prep_adapter(struct adapter *adapter, - const struct adapter_info *ai, int reset) +int t3_prep_adapter(struct adapter *adapter, const struct adapter_info *ai, + int reset) { int ret; unsigned int i, j = 0; -- To unsubscribe from
Re: [PATCH 2.6.25] RDMA/cxgb3: Fix the T3A workaround checks.
thanks, applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 3/3] RDMA/cxgb3: Mark qp as privileged based on user capabilities.
thanks, applied 1-3 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] IPoIB: improve IPv4/IPv6 to IB mcast mapping functions
Any objection to merging the following for 2.6.25? [Rolf -- I think it makes more sense to delete the overwriting of the P_Key in ipoib_multicast.c in this patch rather than later in the series; do you agree?] Thanks, Roland From: Rolf Manderscheid [EMAIL PROTECTED] An IPoIB subnet on an IB fabric that spans multiple IB subnets can't use link-local scope in multicast GIDs. The existing routines that map IP/IPv6 multicast addresses into IB link-level addresses hard-code the scope to link-local, and they also leave the partition key field uninitialised. This patch adds a parameter (the link-level broadcast address) to the mapping routines, allowing them to initialise both the scope and the P_Key appropriately, and fixes up the call sites. The next step will be to add a way to configure the scope for an IPoIB interface. Signed-off-by: Rolf Manderscheid [EMAIL PROTECTED] Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/infiniband/core/cma.c |4 +--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c |4 include/net/if_inet6.h | 11 +++ include/net/ip.h | 10 ++ net/ipv4/arp.c |2 +- net/ipv6/ndisc.c |2 +- 6 files changed, 16 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 312ec74..982836e 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2610,11 +2610,9 @@ static void cma_set_mgid(struct rdma_id_private *id_priv, /* IPv6 address is an SA assigned MGID. */ memcpy(mgid, sin6-sin6_addr, sizeof *mgid); } else { - ip_ib_mc_map(sin-sin_addr.s_addr, mc_map); + ip_ib_mc_map(sin-sin_addr.s_addr, dev_addr-broadcast, mc_map); if (id_priv-id.ps == RDMA_PS_UDP) mc_map[7] = 0x01; /* Use RDMA CM signature */ - mc_map[8] = ib_addr_get_pkey(dev_addr) 8; - mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr); *mgid = *(union ib_gid *) (mc_map + 4); } } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 858ada1..2628339 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -788,10 +788,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) memcpy(mgid.raw, mclist-dmi_addr + 4, sizeof mgid); - /* Add in the P_Key */ - mgid.raw[4] = (priv-pkey 8) 0xff; - mgid.raw[5] = priv-pkey 0xff; - mcast = __ipoib_mcast_find(dev, mgid); if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) { struct ipoib_mcast *nmcast; diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index 448eccb..b24508a 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -269,18 +269,21 @@ static inline void ipv6_arcnet_mc_map(const struct in6_addr *addr, char *buf) buf[0] = 0x00; } -static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf) +static inline void ipv6_ib_mc_map(const struct in6_addr *addr, + const unsigned char *broadcast, char *buf) { + unsigned char scope = broadcast[5] 0xF; + buf[0] = 0;/* Reserved */ buf[1] = 0xff; /* Multicast QPN */ buf[2] = 0xff; buf[3] = 0xff; buf[4] = 0xff; - buf[5] = 0x12; /* link local scope */ + buf[5] = 0x10 | scope; /* scope from broadcast address */ buf[6] = 0x60; /* IPv6 signature */ buf[7] = 0x1b; - buf[8] = 0;/* P_Key */ - buf[9] = 0; + buf[8] = broadcast[8]; /* P_Key */ + buf[9] = broadcast[9]; memcpy(buf + 10, addr-s6_addr + 6, 10); } #endif diff --git a/include/net/ip.h b/include/net/ip.h index 840dd91..50c8889 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -266,20 +266,22 @@ static inline void ip_eth_mc_map(__be32 naddr, char *buf) * Leave P_Key as 0 to be filled in by driver. */ -static inline void ip_ib_mc_map(__be32 naddr, char *buf) +static inline void ip_ib_mc_map(__be32 naddr, const unsigned char *broadcast, char *buf) { __u32 addr; + unsigned char scope = broadcast[5] 0xF; + buf[0] = 0;/* Reserved */ buf[1] = 0xff; /* Multicast QPN */ buf[2] = 0xff; buf[3] = 0xff; addr= ntohl(naddr); buf[4] = 0xff; - buf[5] = 0x12; /* link local scope */ + buf[5] = 0x10 | scope; /* scope from broadcast address */ buf[6] = 0x40; /* IPv4 signature */ buf[7] = 0x1b; - buf[8] = 0;/* P_Key
Bogus commit 70eba18b (make bnx2x select ZLIB_INFLATE)?
Commit 70eba18b (make bnx2x select ZLIB_INFLATE) in Linus's tree seems bogus. As far as I can tell, bnx2x is not upstream yet, and the commit in question actually adds select ZLIB_INFLATE to the TEHUTI config, since there is no BNX2X config option (and also I don't see any reference to zlib in any tehuti files). I assume this is some sort of merge problem that got added to the wrong tree and should be reverted upstream. - R. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: fix lro_gen_skb() alignment
- skb = netdev_alloc_skb(lro_mgr-dev, hlen); + skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN); NET_IP_ALIGN should only be used if you're DMAing into the skb head. Otherwise you should say 2. It would be nice to have another macro for that I suppose. It is certainly simple enough to say 2. Thank you for pointing this out. I have attached a patch to do that.. Signed off by: Andrew Gallatin [EMAIL PROTECTED] Isn't the value of 2 ethernet-specific (to round the 14-byte header up to 16)? Given that the rest of the lro code is fairly careful to calculate mac_hdr_len etc it seems as if it would be cleaner to make this independent of the specific L2 being used. (And I plan on using the LRO module for IP-over-InfiniBand so this is not completely theoretical) - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.25 2/2] RDMA/cxgb3: Support 5.0 firmware.
OK, applied 1 and 2... Note: this change requires 5.0 firmware. I assume the change to the cxgb3 FW versions is pending in a net driver change for 2.6.25? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
Agreed. On first glance, I was intrigued but: 1) Why is everyone so concerned that export symbol space is large? - does it cost cpu or running memory? - does it cause bugs? - or are you just worried about evil modules? 2) These aren't real namespaces - all global names still have to be unique - still have to handle the non-modular build namespace conflicts - there isn't a big problem with conflicting symbols today. Perhaps changing the name from namespace to interface would help? Then a module could have something like MODULE_USE_INTERFACE(foo); and I think that makes it clearer what the advantage of this is: it marks symbols as being part of a certain interface, requires modules that use that interface to declare that use explicitly, and allows reviewers to say Hey why is this code using the scsi interface when it's a webcam driver? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
This patch allows to export symbols only for specific modules by introducing symbol name spaces. A module name space has a white list of modules that are allowed to import symbols for it; all others can't use the symbols. It adds two new macros: MODULE_NAMESPACE_ALLOW(namespace, module); I definitely like the idea of organizing exported symbols into namespaces. However, I feel like it would make more sense to have something like MODULE_NAMESPACE_IMPORT(namespace); inside the modules that use a namespace. This matches the way C preprocessor #includes, C++ namespaces, perl use, etc. works and so it is probably closer to how programmers think. It does weaken the enforcement of review a little bit, since there are no changes to the site where things are exported to catch, but git makes it easy to sneak that kind of change in anyway. The practical benefits are that you don't end up with stupid patch conflicts caused by one patch adding MODULE_NAMESPACE_ALLOW() for module foo and another patch adding it for module bar at the same place, and that it becomes simpler for people to test or deliver, say, a new TCP congestion module without having to rebuild the whole kernel (which is an especially huge pain for distro kernels). In any case I would make use of this in whatever form it gets merged in -- Mellanox ConnectX hardware can operate simultaneously as an InfiniBand adapter and a 10Gb ethernet NIC, and so my driver is split up into a low-level mlx4_core driver that exports symbols for other mlx4 drivers to use; clearly it only makes sense to export them to other parts of the driver, and in this case the difference between ALLOW and IMPORT semantics is not a big deal. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
Yes, and if a symbol is already used by multiple modules, it's generically useful. And if so, why restrict it to in-tree modules? I agree that we shouldn't make things too hard for out-of-tree modules, but I disagree with your first statement: there clearly is a large class of symbols that are used by multiple modules but which are not generically useful -- they are only useful by a certain small class of modules. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
I agree that we shouldn't make things too hard for out-of-tree modules, but I disagree with your first statement: there clearly is a large class of symbols that are used by multiple modules but which are not generically useful -- they are only useful by a certain small class of modules. If it is so clear, you should be able to easily provide examples? Sure -- Andi's example of symbols required only by TCP congestion modules; the SCSI internals that Christoph wants to mark; the symbols exported by my mlx4_core driver (which I admit are currently only used by the mlx4_ib driver, but which will also be used by at least the ethernet NIC driver for the same hardware). I thought this was already covered repeatedly in the thread and indeed in Andi's code so there was no need to repeat it... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.
Except C doesn't have namespaces and this mechanism doesn't create them. So this is just complete and utter makework; as I said before, noone's going to confuse all those udp_* functions if they're not in the udp namespace. I don't understand why you're so opposed to organizing the kernel's exported symbols in a more self-documenting way. It seems pretty clear to me that having a mechanism that requires modules to make explicit which (semi-)internal APIs makes reviewing easier, makes it easier to communicate please don't use that API to module authors, and takes at least a small step towards bringing the kernel's exported API under control. What's the real downside? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24] RDMA/cxgb3: Set the max_qp_init_rd_atom attribute.
thanks, applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/14 v2] nes: eeprom and phy routines
+/* TODO: deal with EEPROM endian issues */ This is pretty scary. Is the driver broken on big-endian systems now? +/* +Everything you wanted to know about CRC algorithms, but were afraid to ask + for fear that errors in your understanding might be detected. Version : 3. etc etc... can all this be replaced with what's in lib/crc32.c? (I hope so) - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/14 v2] nes: OpenFabrics kernel verbs
+/** + * nes_post_send + */ +static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr, +struct ib_send_wr **bad_wr) ... +switch (ib_wr-opcode) { ... +if (ib_wr-num_sge nesdev-nesadapter-max_sge) { +err = -EINVAL; +break; +} ... +default: +/* error */ +err = -EINVAL; +break; looks like if you detect an error while posting a work request, you break out of the switch statement but just continue through the while loop going through the list of work reuqests. Which doesn't seem like it will work very well. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/14 v2] nes: module and device initialization
Thanks Roland. Let me know when you have your branch ready. OK, I pushed out a neteffect branch at git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git This has the driver from your git tree plus some compile fixes and cleanups (added as separate commits, so you can see what I did). If it suits you, let's work against that tree to continue cleaning things up -- you can send me patches or git pull requests to pick up new things. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/14 v2] nes: module and device initialization
OK, a couple quick review comments and a process comment too: - First step in the driver is to kill off a lot of the #ifdefs: +#ifdef IRQF_SHARED The upstream driver really shouldn't have compatibility gunk for older kernels... just make it build against the kernel it's in. +#ifdef OFED_1_2 Same... kernel code shouldn't worry about OFED. +#ifdef CONFIG_PCI_MSI +if (nesdev-msi_enabled) { +pci_disable_msi(pcidev); +} +#endif This can be much simpler, because pci_disable_msi() is always available and is a NOP if the config option is off or MSI is not enabled. So you can just unconditionally do pci_disable_msi(pcidev); +#ifdef NES_NAPI I don't see anything that defines NES_NAPI. I think for the final merge we want a NAPI-only driver (ie no ifdef at all)... is there any performance or other reason to ever build a non-NAPI driver (for a modern kernel)? OK, on a process level, my plan is to pull the current driver into a neteffect branch in my git tree with the intention of merging it for 2.6.25. I'll let you know when that's ready (probably early next week). I'll probably do some cleanups there, and you can send me cleanup/fix patches against that branch any time too. We should try to keep the cycle time short: the interval between the first posting of this driver and the current one was pretty long, and there's a lot of cleanup to do to get ready for the next merge window. Does that plan make sense? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH 14/14 v2] nes: kernel build infrastructure
+ +EXTRA_CFLAGS += -DNES_MINICM I don't see anyplace NES_MINICM is used. Delete this line? + +obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o + +iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o + Also the file has an extra blank line at the beginning and end. Might as well kill them. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/14 v2] nes: module and device initialization
Thanks... I am kind of overloaded trying to handle the last few things for the 2.6.24 merge window, but I will look at this next week, and I expect we should be able to merge the driver for 2.6.25 unless there are unexpected hangups. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/14 v2] nes: device structures and defines
You are starting off on the wrong foot. ??? +if(!(expr)) { \ + printk(KERN_ERR PFX Assertion failed! %s, %s, %s, line %d\n, \ + #expr, __FILE__, __FUNCTION__, __LINE__); \ +} Use BUG_ON I agree that there's no need to invent a driver-private assertion macro, but (to first order at least) drivers should never use BUG_ON. I don't want some glitch in a network driver that the system could probably survive to be turned into a panic by BUG_ON -- WARN_ON seems infinitely preferable. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] [MV643XX_ETH] Remove SHARED_REGS register address bias
+static void __iomem *mv643xx_eth_base; +return readl(((void __iomem *)mv643xx_eth_base) + offset); Given the declaration of mv643xx_eth_base as void __iomem * already, I don't understand why you need the cast to the same type here (and elsewhere in the driver). - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bonding support for eth1394?
The bonding sources have a few occurrences of EOPNOTSUPP. Unless I missed something, they are all related to setting the hardware address of the interface. AFAICS this is impossible with IP over FireWire. If it is crucial to bonding to be able to change the slaves' hardware addresses, then you are out of luck. There are a few changes to the bonding driver pending that will add support for bonding IP-over-InfiniBand interfaces. IPoIB also cannot change its HW address, so the patches address that issue. Once those patches land, bonding eth1394 interfaces may just work. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue
It happens only when ib interfaces are slaves of a bonding device. I thought before that the stuck is in napi_disable() but it's almost right. I put prints before and after call to napi_disable and see that it is called twice. I'll try to investigate in this direction. ib0: stopping interface ib0: before napi_disable ib0: after napi_disable ib0: downing ib_dev ib0: All sends and receives done. ib0: stopping interface ib0: before napi_disable Yes, two napi_disable()s in a row without a matching napi_enable() will deadlock. I guess the question is why the ipoib interface is being stopped twice. If you just take the net-2.6.24 tree (without bonding patches), does bonding for ethernet interfaces work OK, or is there a similar problem with double napi_disable()? How about bonding of ethernet after this batch of bonding patches? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue
I also ran a test for the code in the branch of 2.6.24 and found a problem. I see that ifconfig down doesn't return (for IPoIB interfaces) and it's stuck in napi_disable() in the kernel (any idea why?) For what it's worth, I took the upstream 2.6.23 git tree and merged in Dave's latest net-2.6.24 tree and my latest for-2.6.24 tree and tried that. I brought up an IPoIB interface, sent a few pings, and did ifconfig down, and it worked fine. Can you try the same thing without the bonding patches to see if your setup works OK too? Also can you give more details about what you do to get ifconfig down stuck? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to my TODO list so I don't forget. In fact if 2.6.23 drags on long enough I may do it for 2.6.24 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) I still haven't done much. OK, I will try to get my board booting again this week. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) OK, thanks for the kick in the pants, I have a couple of patches for net-2.6.24 coming (including an unrelated trivial warning fix for IPoIB). - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct net_device objects.) changed the interface to NAPI polling. Fix up the ibm_emac driver so that it works with this new interface. This is actually a nice cleanup because ibm_emac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC 'Yucca' evaluation board. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_emac/ibm_emac_mal.c | 48 -- drivers/net/ibm_emac/ibm_emac_mal.h |2 +- include/linux/netdevice.h | 10 +++ 3 files changed, 28 insertions(+), 32 deletions(-) diff --git a/drivers/net/ibm_emac/ibm_emac_mal.c b/drivers/net/ibm_emac/ibm_emac_mal.c index cabd984..cc3ddc9 100644 --- a/drivers/net/ibm_emac/ibm_emac_mal.c +++ b/drivers/net/ibm_emac/ibm_emac_mal.c @@ -207,10 +207,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct ibm_ocp_mal *mal) { - if (likely(netif_rx_schedule_prep(mal-poll_dev))) { + if (likely(napi_schedule_prep(mal-napi))) { MAL_DBG2(%d: schedule_poll NL, mal-def-index); mal_disable_eob_irq(mal); - __netif_rx_schedule(mal-poll_dev); + __napi_schedule(mal-napi); } else MAL_DBG2(%d: already in poll NL, mal-def-index); } @@ -273,11 +273,11 @@ static irqreturn_t mal_rxde(int irq, void *dev_instance) return IRQ_HANDLED; } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct ibm_ocp_mal *mal = ndev-priv; + struct ibm_ocp_mal *mal = container_of(napi, struct ibm_ocp_mal, napi); struct list_head *l; - int rx_work_limit = min(ndev-quota, *budget), received = 0, done; + int received = 0; MAL_DBG2(%d: poll(%d) %d - NL, mal-def-index, *budget, rx_work_limit); @@ -295,38 +295,34 @@ static int mal_poll(struct net_device *ndev, int *budget) list_for_each(l, mal-poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); - int n = mc-ops-poll_rx(mc-dev, rx_work_limit); + int n = mc-ops-poll_rx(mc-dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit = 0) { - done = 0; + budget -= n; + if (budget = 0) goto more_work; // XXX What if this is the last one ? - } } } /* We need to disable IRQs to protect from RXDE IRQ here */ local_irq_disable(); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); local_irq_enable(); - done = 1; - /* Check for rotting packet(s) */ list_for_each(l, mal-poll_list) { struct mal_commac *mc = list_entry(l, struct mal_commac, poll_list); if (unlikely(mc-ops-peek_rx(mc-dev) || mc-rx_stopped)) { MAL_DBG2(%d: rotting packet NL, mal-def-index); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2(%d: already in poll list NL, mal-def-index); - if (rx_work_limit 0) + if (budget 0) goto again; else goto more_work; @@ -335,12 +331,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev-quota -= received; - *budget -= received; - - MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, *budget, -done ? 0 : 1); - return done ? 0 : 1; + MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, budget, received); + return received; } static void mal_reset(struct ibm_ocp_mal *mal) @@ -425,11 +417,8 @@ static int __init mal_probe(struct ocp_device *ocpdev) mal-def = ocpdev-def; INIT_LIST_HEAD(mal-poll_list); - set_bit(__LINK_STATE_START, mal-poll_dev.state); - mal-poll_dev.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; - mal-poll_dev.poll = mal_poll; - mal-poll_dev.priv = mal; - atomic_set(mal-poll_dev.refcnt, 1); + mal-napi.weight = CONFIG_IBM_EMAC_POLL_WEIGHT; + mal-napi.poll = mal_poll; INIT_LIST_HEAD(mal-list); @@ -520,11 +509,8 @@ static void __exit mal_remove(struct ocp_device *ocpdev) MAL_DBG(%d: remove NL, mal-def
[PATCH 3/4] ibm_new_emac: Nuke SET_MODULE_OWNER() use
Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_newemac/core.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c index ce127b9..8ea5009 100644 --- a/drivers/net/ibm_newemac/core.c +++ b/drivers/net/ibm_newemac/core.c @@ -2549,7 +2549,6 @@ static int __devinit emac_probe(struct of_device *ofdev, dev-ndev = ndev; dev-ofdev = ofdev; dev-blist = blist; - SET_MODULE_OWNER(ndev); SET_NETDEV_DEV(ndev, ofdev-dev); /* Initialize some embedded data structures */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct net_device objects.) changed the interface to NAPI polling. Fix up the ibm_newemac driver so that it works with this new interface. This is actually a nice cleanup because ibm_newemac is one of the drivers that wants to have multiple NAPI structures for a single net_device. Compile-tested only as I don't have a system that uses the ibm_newemac driver. This conversion the conversion for the ibm_emac driver that was tested on real PowerPC 440SPe hardware. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- drivers/net/ibm_newemac/mal.c | 55 ++-- drivers/net/ibm_newemac/mal.h |2 +- 2 files changed, 20 insertions(+), 37 deletions(-) diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c index c4335b7..5885411 100644 --- a/drivers/net/ibm_newemac/mal.c +++ b/drivers/net/ibm_newemac/mal.c @@ -235,10 +235,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance) static inline void mal_schedule_poll(struct mal_instance *mal) { - if (likely(netif_rx_schedule_prep(mal-poll_dev))) { + if (likely(napi_schedule_prep(mal-napi))) { MAL_DBG2(mal, schedule_poll NL); mal_disable_eob_irq(mal); - __netif_rx_schedule(mal-poll_dev); + __napi_schedule(mal-napi); } else MAL_DBG2(mal, already in poll NL); } @@ -318,8 +318,7 @@ void mal_poll_disable(struct mal_instance *mal, struct mal_commac *commac) msleep(1); /* Synchronize with the MAL NAPI poller. */ - while (test_bit(__LINK_STATE_RX_SCHED, mal-poll_dev.state)) - msleep(1); + napi_disable(mal-napi); } void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) @@ -330,11 +329,11 @@ void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac) // XXX might want to kick a poll now... } -static int mal_poll(struct net_device *ndev, int *budget) +static int mal_poll(struct napi_struct *napi, int budget) { - struct mal_instance *mal = netdev_priv(ndev); + struct mal_instance *mal = container_of(napi, struct mal_instance, napi); struct list_head *l; - int rx_work_limit = min(ndev-quota, *budget), received = 0, done; + int received = 0; unsigned long flags; MAL_DBG2(mal, poll(%d) %d - NL, *budget, @@ -358,26 +357,21 @@ static int mal_poll(struct net_device *ndev, int *budget) int n; if (unlikely(test_bit(MAL_COMMAC_POLL_DISABLED, mc-flags))) continue; - n = mc-ops-poll_rx(mc-dev, rx_work_limit); + n = mc-ops-poll_rx(mc-dev, budget); if (n) { received += n; - rx_work_limit -= n; - if (rx_work_limit = 0) { - done = 0; - // XXX What if this is the last one ? - goto more_work; - } + budget -= n; + if (budget = 0) + goto more_work; // XXX What if this is the last one ? } } /* We need to disable IRQs to protect from RXDE IRQ here */ spin_lock_irqsave(mal-lock, flags); - __netif_rx_complete(ndev); + __napi_complete(napi); mal_enable_eob_irq(mal); spin_unlock_irqrestore(mal-lock, flags); - done = 1; - /* Check for rotting packet(s) */ list_for_each(l, mal-poll_list) { struct mal_commac *mc = @@ -387,12 +381,12 @@ static int mal_poll(struct net_device *ndev, int *budget) if (unlikely(mc-ops-peek_rx(mc-dev) || test_bit(MAL_COMMAC_RX_STOPPED, mc-flags))) { MAL_DBG2(mal, rotting packet NL); - if (netif_rx_reschedule(ndev, received)) + if (napi_reschedule(napi)) mal_disable_eob_irq(mal); else MAL_DBG2(mal, already in poll list NL); - if (rx_work_limit 0) + if (budget 0) goto again; else goto more_work; @@ -401,13 +395,8 @@ static int mal_poll(struct net_device *ndev, int *budget) } more_work: - ndev-quota -= received; - *budget -= received; - - MAL_DBG2(mal, poll() %d - %d NL, *budget, -done ? 0 : 1); - - return done ? 0 : 1; + MAL_DBG2(mal, poll() %d - %d NL, budget, received); + return received; } static void mal_reset(struct mal_instance *mal) @@ -538,11 +527,8 @@ static int __devinit mal_probe(struct of_device *ofdev, } INIT_LIST_HEAD(mal
Re: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device
Sorry... wrong subject here; it should have been ibm_newemac: ... - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
I tested this by simulating a slow passive side responder, and it worked as expected for those tests. Using an MRA does add another MAD to the CM exchange, which is why it is sent only after seeing a duplicate request. Alternatively, we can take the OFED module parameter patch. What the heck, I added this for 2.6.24. If it doesn't work out we can back it out. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Vague maybe ppp-related panic report for 2.6.23-rc9
I'm running 2.6.23-rc9 on my laptop, and when in a coffee shop I use a Verizon EVDO card to get network access. This is a kyocera device that looks like a serial adapter behind an ohci usb controller, and uses the airprime driver (for usb device 0c88:17da). The actual IP networking is ppp over that serial adapter. The connection is slightly flaky (I blame this on the Verizon coverage), so I end up restarting things every so often. The point of this email (at last) is that I've just seen my second panic (only info: blinking caps lock led, since I've been in X) that seems correlated to stopping/restarting pppd. Sorry for the lack of detail -- I've just switched to running in the console so if I can provoke the crash again I'll get a little more info. I just wanted to mention this in case someone has seen something similar or has any good ideas about how to capture debugging output on a laptop (when I'm not home and have no other boxes handy). - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Vague maybe ppp-related panic report for 2.6.23-rc9
I don't want to jump the gun on the analysis but it just might be the packet sharing fixes Herbert put in a short time ago. What you could do is go back to say rc2 and see if you still get the panics, then bisect from there to narrow it down. If rc2 still gives the panic, it's something else, perhaps device specific. OK thanks. I'm still trying to find a good way to reproduce it, so I'm going to try to get it to happen with -rc9 while running in the console, and at least take a picture of the backtrace. If I find a good way to make it happen then I'll try rc2 and let you know. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mac80211: Fix TX after monitor interface is converted to managed
Programming with assertions (and BUG_ON is a form of that) is generally a good practice. Almost any book or other source on good programming practices will agree. Yes, it can be overdone. But I don't really think that is the case here, since the check is relatively inexpensive and the consequence should it ever *somehow* happen could be a something wierd (crash, corruption, etc) w/o any other indication of what occured. The problem with BUG_ON is that it kills the whole system. So every time you add a BUG_ON into code, you have to weigh whether the problem you detected is so severe that the right response is to panic. For example, I can see panicking on something fundamental like corrupted page tables. However I would submit that the wireless stack should *never* use BUG_ON -- printing a warning and trying to limp on seems preferable to me. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Vague maybe ppp-related panic report for 2.6.23-rc9
Just as a quick update -- I seem to only be able to reproduce this crash when my ppp session drops, which seems associated with marginal signal. And unfortunately I have great coverage at home so I haven't been able to reproduce this again today. Maybe on the train tomorrow I can crash my laptop... - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
OK -- just to make sure I'm understanding what you're saying: have you confirmed that your proposed [CM MRA] patches actually fix the issue? Not directly. I cannot easily test kernel patches on our larger, production clusters. We've seen the issue with specific applications on 512 and 1024 cores, but I've only been able to test the patch on a 48-core cluster. I have verified that it successfully increases the timeout to where it *should* work, but cannot absolutely confirm that it will fix the problem. I'm unlikely to know that until the production clusters move to an OFED release (1.3?) containing this patch. Umm... this is a difficult situation for me to merge the changes then. We're changing the CM retry behavior blind here. How do we know that the MRA changes don't make the scalability issue worse? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tcp bw in 2.6
It would be really great to see numbers with a more recent kernel than 2.6.18 FWIW Debian has binaries for 2.6.21 in testing and for 2.6.22 in unstable so it should be very easy for Larry to try at least those. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IPoIB: Convert to netdevice internal stats
Use the stats member of struct netdevice in IPoIB, so we can save memory by deleting the stats member of struct ipoib_dev_priv, and save code by deleting ipoib_get_stats(). Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- Dave, can you queue this in net-2.6.24 please? I would ordinarily merge IPoIB changes but since this depends on the netdevice internal stats change it becomes a cross-tree dependency if I try to do that. And I'd like to get it queued in git now before the merge window. Thanks... drivers/infiniband/ulp/ipoib/ipoib.h |2 -- drivers/infiniband/ulp/ipoib/ipoib_cm.c| 20 ++-- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 18 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 22 +++--- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 10 +- 5 files changed, 31 insertions(+), 41 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 3a6ef14..1e627ee 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -300,8 +300,6 @@ struct ipoib_dev_priv { struct ib_event_handler event_handler; - struct net_device_stats stats; - struct net_device *parent; struct list_head child_intfs; struct list_head list; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 08b4676..1afd93c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -430,7 +430,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) ipoib_dbg(priv, cm recv error (status=%d, wrid=%d vend_err %x)\n, wc-status, wr_id, wc-vendor_err); - ++priv-stats.rx_dropped; + ++dev-stats.rx_dropped; goto repost; } @@ -457,7 +457,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) * this packet and reuse the old buffer. */ ipoib_dbg(priv, failed to allocate receive buffer %d\n, wr_id); - ++priv-stats.rx_dropped; + ++dev-stats.rx_dropped; goto repost; } @@ -474,8 +474,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) skb_pull(skb, IPOIB_ENCAP_LEN); dev-last_rx = jiffies; - ++priv-stats.rx_packets; - priv-stats.rx_bytes += skb-len; + ++dev-stats.rx_packets; + dev-stats.rx_bytes += skb-len; skb-dev = dev; /* XXX get correct PACKET_ type here */ @@ -512,8 +512,8 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ if (unlikely(skb-len tx-mtu)) { ipoib_warn(priv, packet len %d ( %d) too long to send, dropping\n, skb-len, tx-mtu); - ++priv-stats.tx_dropped; - ++priv-stats.tx_errors; + ++dev-stats.tx_dropped; + ++dev-stats.tx_errors; ipoib_cm_skb_too_long(dev, skb, tx-mtu - IPOIB_ENCAP_LEN); return; } @@ -532,7 +532,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ tx_req-skb = skb; addr = ib_dma_map_single(priv-ca, skb-data, skb-len, DMA_TO_DEVICE); if (unlikely(ib_dma_mapping_error(priv-ca, addr))) { - ++priv-stats.tx_errors; + ++dev-stats.tx_errors; dev_kfree_skb_any(skb); return; } @@ -542,7 +542,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ if (unlikely(post_send(priv, tx, tx-tx_head (ipoib_sendq_size - 1), addr, skb-len))) { ipoib_warn(priv, post_send failed\n); - ++priv-stats.tx_errors; + ++dev-stats.tx_errors; ib_dma_unmap_single(priv-ca, addr, skb-len, DMA_TO_DEVICE); dev_kfree_skb_any(skb); } else { @@ -580,8 +580,8 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx ib_dma_unmap_single(priv-ca, tx_req-mapping, tx_req-skb-len, DMA_TO_DEVICE); /* FIXME: is this right? Shouldn't we only increment on success? */ - ++priv-stats.tx_packets; - priv-stats.tx_bytes += tx_req-skb-len; + ++dev-stats.tx_packets; + dev-stats.tx_bytes += tx_req-skb-len; dev_kfree_skb_any(tx_req-skb); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index b664b98..1a77e79 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -208,7 +208,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) * this packet and reuse the old buffer
Re: [PATCH] IPoIB: Convert to netdevice internal stats
How is that ibm_emac NAPI conversion coming along? :-) Sorry, trying to reduce my backlog first, but it is still on my list of things to work on :) - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver
Roland, are you comfortable with the IB changes in patches 1 and 2? Yes, they look fine to me. Feel free to apply, with Acked-by: Roland Dreier [EMAIL PROTECTED] - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister
The action in bonding to a detach of slave is to unregister the master (see patch 10). This can't be done from the context of unregister_netdevice itself (it is protected by rtnl_lock). I'm confused. Your patch has: +ipoib_slave_detach(cpriv-dev); unregister_netdev(cpriv-dev); And ipoib_slave_detach() is: +static inline void ipoib_slave_detach(struct net_device *dev) +{ +rtnl_lock(); +netdev_slave_detach(dev); +rtnl_unlock(); +} so you are calling netdev_slave_detach() with the rtnl lock held. Why can't you make the same call from the start of unregister_netdevice()? Anyway, if the rtnl lock is a problem, can you just add the call to netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
Just change the makefiles to always install gzip'ed modules modutils knows how to unzip them on the fly. But that leaves the uncompressed firmware blobs in .data that ends up in unswappable kernel memory. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister
+ipoib_slave_detach(cpriv-dev); unregister_netdev(cpriv-dev); Maybe you already answered this before, but I'm still not clear why this notifier call can't just be added to the start of unregister_netdevice(), so we can avoid having driver needing to know anything about bonding internals? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery.
Roland - can you please queue this up for 2.6.24? Done, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
One other thing I observed is that I can not unload the module as the ref counter of the eth device is too low. I haven't tracked down the source of this problem yet. I suspect that this is because netif_rx_reschedule() was missing a dev_hold() to match the dev_put() in netif_rx_complete(). Dave merged a fix for that problem yesterday, so current net-2.6.24 should be OK. Let us know if it's not OK, because then there's another bug... - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement.
Thanks for the patch... If an application is calling rdma_resolve_ip() and a status of -ENODATA is returned from addr_resolve_local/remote(), the timeout mechanism waits until the application's timeout occurs before rechecking the address resolution status; the application will wait until it's full timeout occurs. This case is seen when the work thread call to process_req() is made before the arp packet is processed. I'm having a hard time understanding this changelog. Could you please resend with a description that lets me understand: - What the current behavior is and what is wrong with that; - What the behavior should be; - And how your patch changes the behavior to be correct. This patch is in addition to Steve Wise's neigh_event_send patch to initiate neighbour discovery sent on 9/12/2007. Does this mean it depends on Steve's patch being applied first? Also please try to keep lines in the changelog to 72 characters or so... @@ -136,6 +137,7 @@ static void set_timeout(unsigned long ti static void queue_req(struct addr_req *req) { struct addr_req *temp_req; +unsigned long req_timeout = msecs_to_jiffies(MIN_ADDR_TIMEOUT_MS) + jiffies; mutex_lock(lock); list_for_each_entry_reverse(temp_req, req_list, list) { @@ -145,8 +147,10 @@ static void queue_req(struct addr_req *r list_add(req-list, temp_req-list); -if (req_list.next == req-list) +if (req_list.next == req-list) { +req_timeout = min(req_timeout, req-timeout); set_timeout(req-timeout); +} mutex_unlock(lock); } I don't understand this code. It seems you keep track of the minimum timeout, and then ignore the value you computed. What am I missing? Thanks, Roland - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
I think that if there are no other comments, I will submit the entire 11 patches again (with changes) to make it easier to merge into the kernel. Since the most of the content in the patch series is in bonding I thought it would be right that Jay will push all the patches to the networking git. Is it OK with you Roland? Yes, that's fine. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang
Thanks for testing on ehca... While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with I don't think you're actually using rc6 bits, since in your patch you have: -poll_more: and I think that is only in Dave's net-2.6.24 tree now, right? The problem is that the poll handler does netif_rx_complete (which does a dev_put) followed by netif_rx_reschedule() to schedule for more receives (which again does a dev_put). This reduces refcount to 0 (depending on how many times netif_rx_complete followed by netif_rx_reschedule was called). Dave, the real problem seems to be that netif_rx_recschedule() calls __napi_schedule() rather than __netif_rx_schedule(), so it misses the call to dev_hold() that is needed to balance the dev_put() in netif_rx_complete(). The current netif_rx_reschedule() looks like it really should be napi_reschedule(), and we need a new function that takes a netdev too. Or am I misunderstanding the refcounting? I'll send a patch once I've had some breakfast and had a chance to at least compile it... Krishna, unfortunately your proposed fix has a race: -netif_rx_complete(dev, napi); -if (unlikely(ib_req_notify_cq(priv-cq, - IB_CQ_NEXT_COMP | - IB_CQ_REPORT_MISSED_EVENTS)) -netif_rx_reschedule(napi)) -goto poll_more; +if (likely(!ib_req_notify_cq(priv-cq, + IB_CQ_NEXT_COMP | + IB_CQ_REPORT_MISSED_EVENTS))) It is possible for an interrupt to happen immediately right here, before the netif_rx_complete(), so that netif_rx_schedule() gets called while we are still on the poll list. +netif_rx_complete(dev, napi); - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.24
Roland, could you merge the common TX CQ patch please? It actually fixes a real problem. Yes, I will, but it collides with the net-2.6.24 NAPI rework I think, so it may not go in until a few days after the merge window. Have you verified that the patch cures the interrupt overload issues? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
Maybe this new notification function should be in net/core/dev.c instead of exporting call_netdevice_notifiers()? Or actually, does it work to add the call to the notifiers directly in unregister_netdev() so that device drivers don't have to worry about it? (And is the existing patch missing a call to notifiers in ipoib_vlan_delete()?) - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.24] Fix documentation for dev_put()/dev_hold()
It looks like the comments for dev_put() and dev_hold() got reversed somehow. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- include/linux/netdevice.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be5fe05..239ae68 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1030,7 +1030,7 @@ extern intnetdev_budget; extern void netdev_run_todo(void); /** - * dev_put - get reference to device + * dev_put - release reference to device * @dev: network device * * Hold reference to device to keep it from being freed. @@ -1041,7 +1041,7 @@ static inline void dev_put(struct net_device *dev) } /** - * dev_hold - release reference to device + * dev_hold - get reference to device * @dev: network device * * Release reference to device to allow it to be freed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule()
netif_rx_complete() takes a netdev parameter and does dev_put() on that netdev, so netif_rx_reschedule() needs to also take a netdev parameter and do dev_hold() on it to avoid reference counts from getting becoming negative because of unbalanced dev_put()s. This should fix the problem reported by Krishna Kumar [EMAIL PROTECTED] with IPoIB waiting forever for netdev refcounts to become 0 during module unload. Signed-off-by: Roland Dreier [EMAIL PROTECTED] --- Dave, feel free to roll this up into earlier NAPI conversion patches (assuming I'm understanding things correctly and this patch actually makes sense!). BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have built in the current net-2.6.24 tree, since its call to netif_rx_reschedule() was left with the netdev parameter. So that file does not need to be touched in this patch. drivers/infiniband/ulp/ipoib/ipoib_ib.c |2 +- drivers/net/arm/ep93xx_eth.c|2 +- drivers/net/ehea/ehea_main.c|2 +- drivers/net/ibmveth.c |2 +- include/linux/netdevice.h | 21 +++-- 5 files changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 6a2bff4..481e4b6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -320,7 +320,7 @@ poll_more: if (unlikely(ib_req_notify_cq(priv-cq, IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)) - netif_rx_reschedule(napi)) + netif_rx_reschedule(dev, napi)) goto poll_more; } diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c index f3858d1..7f016f3 100644 --- a/drivers/net/arm/ep93xx_eth.c +++ b/drivers/net/arm/ep93xx_eth.c @@ -309,7 +309,7 @@ poll_some_more: } spin_unlock_irq(ep-rx_lock); - if (more netif_rx_reschedule(napi)) + if (more netif_rx_reschedule(dev, napi)) goto poll_some_more; } diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 4a5ab4a..9a499f4 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -636,7 +636,7 @@ static int ehea_poll(struct napi_struct *napi, int budget) if (!cqe !cqe_skb) return 0; - if (!netif_rx_reschedule(napi)) + if (!netif_rx_reschedule(dev, napi)) return 0; } diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c index b8d7cec..b94f266 100644 --- a/drivers/net/ibmveth.c +++ b/drivers/net/ibmveth.c @@ -973,7 +973,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) netif_rx_complete(netdev, napi); if (ibmveth_rxq_pending_buffer(adapter) - netif_rx_reschedule(napi)) { + netif_rx_reschedule(netdev, napi)) { lpar_rc = h_vio_signal(adapter-vdev-unit_address, VIO_IRQ_DISABLE); goto restart_poll; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index be5fe05..0dbf185 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1198,16 +1198,6 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits) return (1 debug_value) - 1; } -/* Try to reschedule poll. Called by dev-poll() after netif_rx_complete(). */ -static inline int netif_rx_reschedule(struct napi_struct *n) -{ - if (napi_schedule_prep(n)) { - __napi_schedule(n); - return 1; - } - return 0; -} - /* Test if receive needs to be scheduled but only if up */ static inline int netif_rx_schedule_prep(struct net_device *dev, struct napi_struct *napi) @@ -1234,6 +1224,17 @@ static inline void netif_rx_schedule(struct net_device *dev, __netif_rx_schedule(dev, napi); } +/* Try to reschedule poll. Called by dev-poll() after netif_rx_complete(). */ +static inline int netif_rx_reschedule(struct net_device *dev, + struct napi_struct *napi) +{ + if (napi_schedule_prep(napi)) { + __netif_rx_schedule(dev, napi); + return 1; + } + return 0; +} + /* same as netif_rx_complete, except that local_irq_save(flags) * has already been issued */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule()
Further complicating things is that you need to setup a ppc32 cross-build environment to even build test a conversion, and I'm not comfortable doing the surgery until I can test build the thing. OK, I actually have a system with a ppc 440 SoC that uses this driver, so I'll try to get things to the stage where I can boot net-2.6.24 on it and see if I can get the driver working... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang
and I think that is only in Dave's net-2.6.24 tree now, right? Nope, that was what I downloaded yesterday: VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 23 EXTRAVERSION =-rc6 NAME = Pink Farting Weasel Please double check your tree. I just very carefully looked at my trees, and the poll_more: label is added in commit 6b460a71 ([NET]: Make NAPI polling independent of struct net_device objects.) which is only in the net-2.6.24 tree. Of course Dave did not change the version information in the Makefile since he wouldn't want Linus to pick up any extra strange changes when he pulls, so a net-2.6.24 tree will look like 2.6.23-rc6 as you quoted. And the refcounting bug I fixed is only in net-2.6.24. To be clear, netif_rx_schedule while we are still in the poll list will not do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared by netif_rx_complete which has not yet run). Effectively we lost/delayed processing an interrupt, if I understood the code right. Right, we lose an interrupt, and since the CQ events are one-shot, we never get another one, and the interface is effectively dead. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.24
IPoIB CM handles this properly by gathering together single pages in skbs' fragment lists. Then can we reuse IPoIB CM code here? Yes, if possible, refactoring things so that the rx skb allocation code becomes common between CM and non-CM would definitely make sense. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
The IGMP enabling patch posted by me on September 2nd isn't on your list http://lists.openfabrics.org/pipermail/general/2007-September/040250.html can you add it? Yes, I lost that somehow. I will add it to my list of things to take a look at (no opinion yet). - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/11] IB/ipoib: Export call to call_netdevice_notifiers and add new private flag
I tried to look at the ipoib stuff in this series... this patch looks fine but it doesn't actually touch ipoib, so the subject line is a bit misleading... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] IB/ipoib: Verify address handle validity on send
Looks fine overall, with one minor nitpick: -if (unlikely(memcmp(neigh-dgid.raw, +if (unlikely((memcmp(neigh-dgid.raw, skb-dst-neighbour-ha + 4, -sizeof(union ib_gid { +sizeof(union ib_gid))) || + (neigh-dev != dev))) { the indentation here makes this confusing to read -- I would just do: } else if (neigh-ah) { if (unlikely(memcmp(neigh-dgid.raw, skb-dst-neighbour-ha + 4, - sizeof(union ib_gid { + sizeof(union ib_gid)) || +neigh-dev != dev)) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
Actually, thinking about this some more... would it be cleaner to more the knowledge about bonding out of the ipoib driver? in other words, export something similar to +static int ipoib_slave_detach(struct net_device *dev) +{ +int ret = 0; +if (dev-flags IFF_SLAVE) { +dev-priv_flags |= IFF_SLAVE_DETACH; +rtnl_lock(); +ret = call_netdevice_notifiers(NETDEV_CHANGE, dev); +rtnl_unlock(); +} +return ret; +} for drivers to use, rather than putting use of IFF_SLAVE and IFF_SLAVE_DETACH outside of the bonding driver. Also it seems this function could return void, since both call sites ignore the return value and I don't see anything sensible that IPoIB could do with the notifier chain return value anyway. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
OK with me. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue
Overall idea looks good... one comment: +if (n-dev-flags IFF_MASTER) { +/* n-dev is not an IPoIB device and we have +to take priv from elsewhere */ +neigh = *to_ipoib_neigh(n); +if (neigh) { +priv = netdev_priv(neigh-dev); +ipoib_dbg(priv, neigh_destructor for bonding device: %s\n, + n-dev-name); +} else +return; +} seems like it would be cleaner not to worry about bonding here and just use neigh-dev all the time rather than having two ways to look up the device. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister
Conceptually, I see your point and I'm ok with doing it either way. My only question is, would this change would make the ipoib module dependent upon having the bonding module loaded (to resolve all of the symbols)? Yes, I guess so, if that function is in bonding. Hmm, that wouldn't be a good change. Maybe this new notification function should be in net/core/dev.c instead of exporting call_netdevice_notifiers()? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.
Maybe you should automatically create an alias each time new interface is added so that admin would not care about proper aliases? I agree that makes much more sense from a user interface point of view. Unfortunately an alias without an address doesn't make sense, so there doesn't seem to be a way to implement that. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.24
The patch is just needed to pick up broadcast MTU size instead of hard coding 2K right now. SKB allocation shouldn't be different with Ethernet Jambo Frame and IPoIB-CM which 64K MTU. I don't understand why it's different. Could you please explain this? It's exactly the same problem as ethernet jumbo frames. A web search for 'order 1 failure e1000' might be interesting. IPoIB CM handles this properly by gathering together single pages in skbs' fragment lists. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
I've been meaning to track down the bnx2 iscsi offload patch to look and see if this issue is addressed, since the same problem seems to exist: it seems an iscsi connection and a main stack tcp connection might share the same 4-tuple unless something is done to avoid that happening. iSCSI does not do passive listens, only active connections to the target. But you're right, the port space is still shared between iSCSI and the main stack. We currently rely on user apps binding to the main stack to reserve certain ephemeral ports, and telling the iSCSI driver which ports to use. Got it... I wasn't thinking that clearly, but it is clear that a full 4-tuple collision with only active connections is quite unlikely. I guess you would have to make both an offloaded and a non-offloaded iSCSI connection to the same target and get really unlucky with ephemeral port allocation. So in practice I guess it's not an issue at all with your driver yet. However, do you have any plans to support iSCSI offload for targets? Also, looking at the first CNIC patch, I can't help but notice that you seem to have at least some support for iWARP there. How does the CNIC look? Does it share the same interface/addresses as the non-offload NIC, or does it create a completely separate netdevice? I want to make sure that whatever solution we come up with for cxgb3 doesn't cause problems for you. And of course if you have a better idea than what Steve has come up with, that would be great :) - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
InfiniBand/RDMA merge plans for 2.6.24
support. Would be nice to see a review from someone at Mellanox. - Arthur's mthca doorbell alignment fixes. I will experiment with a few different approaches and post what I like (and fix mlx4 as well). I hope Arthur can review. - Michael's mlx4 WQE shrinking patch. Not sure yet; I'll reply to the latest patch directly. Here are a few topics that I believe will not be ready in time for the 2.6.24 window and will need to wait for 2.6.25: - Multiple CQ event vector support. I haven't seen any discussions about how ULPs or userspace apps should decide which vector to use, and hence no progress has been made since we deferred this during the 2.6.23 merge window. - XRC. Given the length of the backlog above and the fact that a first draft of this code has not been posted yet, I don't see any way that we could have something this major ready in time. Here is the complete list of patches I have in my for-2.6.24 branch waiting for the merge window so far. Mostly I haven't merged anything big out of my backlog, so this is essentially all Ali Ayoub (1): IB/sa: Error handling thinko fix Anton Blanchard (3): IB/fmr_pool: Clean up some error messages in fmr_pool.c IB/ehca: Make output clearer by removing some debug messages IB/ehca: Export module parameters in sysfs Dotan Barak (1): mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS Eli Cohen (2): IPoIB: Fix typo to end statement with ';' instead of ',' IPoIB: Fix error path memory leak Michael S. Tsirkin (2): mlx4_core: Enable MSI-X by default IB/mthca: Enable MSI-X by default Peter Oruba (1): IB/mthca: Use PCI-X/PCI-Express read control interfaces Roland Dreier (6): IPoIB: Make sure no receives are handled when stopping device IB: find_first_zero_bit() takes unsigned pointer mlx4_core: Don't free special QPs in QP number bitmap IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv() IB/ehca: Include linux/mutex.h from ehca_classes.h IB/mlx4: Fix up SRQ limit_watermark endianness Steve Wise (1): RDMA/cxgb3: Make the iw_cxgb3 module parameters writable - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.24
Since ehca can support 4K MTU, we would like to see a patch in IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 kernel. The idea is IPoIB link MTU will pick up a return value from SM's default broadcast MTU. This patch should be a small patch, I hope you are OK with this. It's actually not small, since it turns the skb allocation into a 4100-byte buffer, which ends up being more than 1 page usually, which means it fails if memory is fragmented. Anyway given the backlog anything substantial that hasn't been posted already is almost surely going to have to wait until 2.6.25. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
- My user_mad P_Key index support patch. I'll test the ioctl to change to the new mode and merge this I guess, since Hal and Sean have tested this out. I can give this patch a reviewed-by: too, and I will also try to review a couple of the pending ipoib patches. Thanks! - Sean's QoS changes. These look fine at first glance, and I just plan to understand the backwards compatibility story (ie how this works with an old SM) and merge. Anyone who objects let me know. The new QoS fields fall into fields that are currently reserved, which should be ignored by an older SM. I've only tested this against openSM however. That seems OK -- I'm OK with breaking things if an SM is clearly buggy (and not ignoring fields that are defined to be ignored in the spec would certainly be a clear bug to me). This patch was generated in response to an Intel MPI issue. We've seen MPI take several minutes to respond to a connection request during the middle of large application runs. When this happens, the active side times out the connection. In OFED, we added module parameters to adjust the rdma_cm connection timeout on the active side, but I believe that sending an MRA from the passive side is a better solution. OK -- just to make sure I'm understanding what you're saying: have you confirmed that your proposed patches actually fix the issue? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
I was about to post v2 of my patch to avoid port space collisions with the native stack. Can we get that 2.6.24? It is high priority IMO. I've tried to solicit review on it, but I think folks are reluctant... ;-) I would like to get this in, but I'm still at least a little reluctant, since we would be committing to a user interface that seems a little awkward at best, so I'd like to try and find something better. Just to summarize my understanding: - your patch requires the administration to configure an ethX:iwY alias address to use iwarp. (By the way is there anything other than don't do that that avoids assigning the same address to the iwarp alias and a non-iwarp interface?) - it would be nicer to create the alias automatically, but an alias without an address doesn't make sense. Creating a whole separate net device causes problems because the iwarp stuff still needs to use the main net device to do ARP etc. - so I'm out of better ideas but I still want to push back a little before we commit to something ugly. I've been meaning to track down the bnx2 iscsi offload patch to look and see if this issue is addressed, since the same problem seems to exist: it seems an iscsi connection and a main stack tcp connection might share the same 4-tuple unless something is done to avoid that happening. Also, I think it behooves us to get some agreement on this approach with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers seem to be imminent. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
Well, if it involves /sharing/ port space with the native stack, i.e. where port 1234 is IB but 1235 is Linux, pretty much all the networking devs have NAK'd that approach AFAICS. Just to be clear, InfiniBand has no problem; the issue is port collisions involving iWARP connections. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts with the host stack.
What's wrong with my suggestion of having the iwarp driver create an iwX interface to go with the normal ethX interface? It seems simpler to me, and there's a somewhat similar precedent with how mac80211 devices create both wlan0 and wmaster0 interfaces. - R. It seemed much more painful for me to implement. :-) I'll look into this, but I think for this to be done, the changes must be in the cxgb3 driver, not the rdma driver, because the guts of the netdev struct are all private to cxgb3. Remember that this interface needs to still do non TCP traffic (like ARP and UDP)... Maybe you have something in mind here that I'm not thinking about? No, I was just spouting off. But the whole create a magic alias seems kind of unfriendly to the user. Maybe as you said, the cxgb3 net driver could create the alias for the iw_cxgb3 driver? - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts with the host stack.
The sysadmin creates for iwarp use only alias interfaces of the form devname:iw* where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with iw. The iw immediately after the ':' is the key used by the iwarp driver. What's wrong with my suggestion of having the iwarp driver create an iwX interface to go with the normal ethX interface? It seems simpler to me, and there's a somewhat similar precedent with how mac80211 devices create both wlan0 and wmaster0 interfaces. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
Sorry for the long latency, I was at the beach all last week. And direct data placement really does give you a factor of two at least, because otherwise you're stuck receiving the data in one buffer, looking at some of the data at least, and then figuring out where to copy it. And memory bandwidth is if anything becoming more valuable; maybe LRO + header splitting + page remapping tricks can get you somewhere but as NCPUS grows then it seems the TLB shootdown cost of page flipping is only going to get worse. As Herbert has said already, people can code for this just like they have to code for RDMA. No argument, you need to change the interface to take advantage of RDMA. There is no fundamental difference from converting an application to sendfile or similar. Yes, on the transmit side, there's not much difference from sendfile or splice, although RDMA may give a slightly nicer interface that also gives basically the equivalent of AIO. The only thing this needs is a recvmsg_I_dont_care_where_the_data_is() call. There are no alignment issues unless you are trying to push this data directly into the page cache. I don't understand how this gives you the same thing as direct data placement (DDP). There are many situations where the sender knows where the data has to go and if there's some way to pass that to the receiver, so that info can be used in the receive path to put the data in the right place, the receiver can save a copy. This is fundamentally the same offload that an FC HBA does -- the SCSI midlayer queues up commands like read block A and put the data at address X and read block B and put the data at address Y and the HBA matches tags on incoming data to put the blocks at the right addresses, even if block B is received before block A. RFC 4297 has some discussion of the various approaches, and while you might not agree with their conclusions, it is interesting reading. Couple this with a card that makes sure that on a per-page basis, only data for a particular flow (or group of flows) will accumulate. It seems that the NIC would also have to look into a TCP stream (and handle out of order segments etc) to find message boundaries for this to be equivalent to what an RDMA NIC does. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.23 RESEND] cxgb3 - Fix dev-priv usage
I take that back. Rejected -- it breaks infiniband build. To be more precise: drivers/infiniband/hw/cxgb3/cxio_hal.c: In function 'cxio_rdev_open': drivers/infiniband/hw/cxgb3/cxio_hal.c:919: error: implicit declaration of function 'T3CDEV' it seems the problem is that T3CDEV() has been deleted and been replaced with the dev2t3cdev() inline function. However a simple replacement s/T3CDEV/dev2t3cdev/ in drivers/infiniband/hw/cxgb3 doesn't work because the function has moved from t3cdev.h to adapter.h; and moving the function back to t3cdev.h doesn't work because it depends on more structure definitions now. And at that point I gave up... - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.23 0/2] cxgb3 - Fix dev-priv usage
Looks OK to me but I would just roll up the second patch into the first patch and let Jeff merge it as one commit. There's no point in creating an intermediate tree that doesn't build -- it just breaks git bisect for no useful purpose. Also as a side note, when trying to test this I got the message could not load TP SRAM: unable to load t3a_protocol_sram-1.0.44.bin and you guys seem to only have t3b protocol sram images on your web site. Could you send me the t3a file (or swap out my T3A boards for T3B boards ;)? Thanks, Roland - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
[TSO / LRO discussion snipped -- it's not the main point so no sense spending energy arguing about it] Just be realistic and accept that RDMA is a point in time solution, and like any other such technology takes flexibility away from users. Horizontal scaling of cpus up to huge arity cores, network devices using large numbers of transmit and receive queues and classification based queue selection, are all going to work to make things like RDMA even more irrelevant than they already are. To me there is a real fundamental difference between RDMA and traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that messages can carry the address where they're supposed to be delivered (what the IETF calls direct data placement). And on top of that you can build one-sided operations aka put/get aka RDMA. And direct data placement really does give you a factor of two at least, because otherwise you're stuck receiving the data in one buffer, looking at some of the data at least, and then figuring out where to copy it. And memory bandwidth is if anything becoming more valuable; maybe LRO + header splitting + page remapping tricks can get you somewhere but as NCPUS grows then it seems the TLB shootdown cost of page flipping is only going to get worse. Don't get too hung up on the fact that current iWARP (RDMA over IP) implementations are using TCP offload -- to me that is just a side effect of doing enough processing on the NIC side of the PCI bus to be able to do direct data placement. InfiniBand with competely different transport, link and physical layers is one way to implement RDMA without TCP offload and I'm sure there will be others -- eg Intel's IOAT stuff could probably evolve to the point where you could implement iWARP with software TCP and the data placement offloaded to some DMA engine. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
Isn't RDMA _part_ of the software net stack within Linux? It very much is not so. This is just nit-picking. You can draw the boundary of the software net stack wherever you want, but I think Sean's point was just that RDMA drivers already are part of Linux, and we all want them to get better. When using RDMA you lose the capability to do packet shaping, classification, and all the other wonderful networking facilities you've grown to love and use over the years. Same thing with TSO and LRO and who knows what else. I know you're going to make a distinction between stateless and stateful offloads, but really it's just an arbitrary distinction between things you like and things you don't. Imagine if you didn't know any of this, you purchase and begin to deploy a huge piece of RDMA infrastructure, you then get the mandate from IT that you need to add firewalling on the RDMA connections at the host level, and oh shit you can't? It's ironic that you bring up firewalling. I've had vendors of iWARP hardware tell me they would *love* to work with the community to make firewalling work better for RDMA connections. But instead we get the catch-22 of your changing arguments -- first, you won't even consider changes that might help RDMA work better in the name of maintainability; then you have to protect poor, ignorant users from accidentally using RDMA because of some problem or another; and then when someone tries to fix some of the problems you mention, it's back to step one. Obviously some decisions have been prejudged here, so I guess this moves to the realm of politics. I have plenty of interesting technical stuff, so I'll leave it to the people with a horse in the race to find ways to twist your arm. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html