[PATCH] qed: Fix potential use-after-free in qed_spq_post()

2018-01-15 Thread Roland Dreier
From: Roland Dreier <rol...@purestorage.com>

We need to check if p_ent->comp_mode is QED_SPQ_MODE_EBLOCK before
calling qed_spq_add_entry().  The test is fine is the mode is EBLOCK,
but if it isn't then qed_spq_add_entry() might kfree(p_ent).

Signed-off-by: Roland Dreier <rol...@purestorage.com>
---
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index be48d9abd001..3588081b2e27 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -776,6 +776,7 @@ int qed_spq_post(struct qed_hwfn *p_hwfn,
int rc = 0;
struct qed_spq *p_spq = p_hwfn ? p_hwfn->p_spq : NULL;
bool b_ret_ent = true;
+   bool eblock;
 
if (!p_hwfn)
return -EINVAL;
@@ -794,6 +795,11 @@ int qed_spq_post(struct qed_hwfn *p_hwfn,
if (rc)
goto spq_post_fail;
 
+   /* Check if entry is in block mode before qed_spq_add_entry,
+* which might kfree p_ent.
+*/
+   eblock = (p_ent->comp_mode == QED_SPQ_MODE_EBLOCK);
+
/* Add the request to the pending queue */
rc = qed_spq_add_entry(p_hwfn, p_ent, p_ent->priority);
if (rc)
@@ -811,7 +817,7 @@ int qed_spq_post(struct qed_hwfn *p_hwfn,
 
spin_unlock_bh(_spq->lock);
 
-   if (p_ent->comp_mode == QED_SPQ_MODE_EBLOCK) {
+   if (eblock) {
/* For entries in QED BLOCK mode, the completion code cannot
 * perform the necessary cleanup - if it did, we couldn't
 * access p_ent here to see whether it's successful or not.
-- 
2.14.1



Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-08-04 Thread Roland Dreier
> I think the conclusion is that a hard-wired ACS capability is a
> positive indication of isolation for a multifunction device, the code
> is intended to support this and appears to do so, and Roland was going
> to investigate the sightings that inspired this patch in more detail.
> Dropping for now is appropriate.  Thanks,

That's right.  I confirmed that the issue we found was due to another PCI quirk.

It may make sense to add more 82599 variants to the table, but the X540 and X550
work without a quirk.

Sorry for the noise.

 - R.


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-07-24 Thread Roland Dreier
> Is there a misunderstanding of the code flow here?  We're never setting
> EC.  In the first code block we're just masking out requested
> capabilities where unimplemented capabilities is the same as
> implemented + enabled.  We're not adding EC to the request, we're
> just not removing it based on the implemented capabilities because we
> don't interpret it the same way.  Thus if someone wants to test a
> device for EC, it really needs to implement EC, we cannot assume it
> based on lack of support for EC in the ACS capability.  As you point
> out, nobody really cares about EC yet though.

I guess I find the semantics confusing.  For every other bit,
pci_acs_enabled() returns true if the bit is either enabled or not
implemented.  For EC, it returns false if the bit is not implemented.

It's not clear to me what the use case for checking for PCI_ACS_EC
enabled would be.  Seems like checking for EC in the capabilities
register would be more useful.

 - R.


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-07-24 Thread Roland Dreier
> I think that the device implementing ACS and not implementing certain
> features that are "must be implemented if..." is a positive indication
> that the device does not have that behavior and therefore the ability
> to control that behavior is unnecessary.  pci_acs_flags_enabled() is
> coded with this intention:
>
> static bool pci_acs_flags_enabled(struct pci_dev *pdev, u16 acs_flags)
> {
> int pos;
> u16 cap, ctrl;
>
> pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
> if (!pos)
> return false;
>
> /*
>  * Except for egress control, capabilities are either required
>  * or only required if controllable.  Features missing from the
>  * capability field can therefore be assumed as hard-wired enabled.
>  */
> pci_read_config_word(pdev, pos + PCI_ACS_CAP, );
> acs_flags &= (cap | PCI_ACS_EC);
>
> 
>
> pci_read_config_word(pdev, pos + PCI_ACS_CTRL, );
> return (ctrl & acs_flags) == acs_flags;
>
> 

I'm not sure it makes sense to look for the EC bit in the control register if
the capability bit says it is not supported.  In fact I think it would
violate the PCI spec to set the bit in the control register if it is zero in the
capability register - the spec says:

  Default value of this bit is 0b. Must be hardwired to 0b if the ACS P2P
  Egress Control functionality is not implemented.

I don't understand the reasoning behind forcing the EC bit in commit
83db7e0bdb70.

I think the only reason this works is that nowhere in the kernel uses
this function to check PCI_ACS_EC.

> At least that's my interpretation and how I've indicated to folks at
> Intel how things should work for a device that does not support p2p.
> Of course this all hinges on having an ACS capability.  Without an ACS
> capability we must assume any sort of p2p is possible unless we have a
> quirk to tell us otherwise.  With ACS, the absence of capabilities is a
> positive indication that those forms of p2p are not possible (depending
> on the spec wording for that capability).  If the code doesn't behave
> that way, let's fix it. Thanks,

Thanks - I'm looking more closely at what goes wrong with X550, since
from reading the code it looks like it should work, but people have observed
that it doesn't on our system.  However the issue might be elsewhere.

However I'll send a patch removing that "| PCI_ACS_EC" from the check if
you agree - maybe I'm misunderstanding the logic but I don't see how it
could work if it ever became relevant.

 - R.


Re: [PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-07-20 Thread Roland Dreier
On Thu, Jul 20, 2017 at 3:15 PM, Alex Williamson
 wrote:

> Most of the ACS capabilities are worded as "Must be implemented by
> devices that implement ..."  Shouldn't a hard-wired ACS capability
> sufficiently describe that, or is there something wrong with how
> they're hard wired?

Interesting question.  I just looked at what the PCI spec says about
the various bits for ACS functions in multi-function devices.  Many of
the functions are "must not be implemented."  Of the ones that are
"must be implemented if..." the key is that the if is for devices that
support peer-to-peer traffic.

I think the Intel NICs are compliant with the spec - they don't
support any ACS mechanisms for controlling peer-to-peer traffic, but
they also don't implement peer-to-peer traffic.  This means that the
PCI standard way of knowing that it is safe to assign individual
functions does not prove it is safe - but with device-specific
knowledge we do know it is safe.  Hence a quirk to give that
device-specific information to the kernel.

 - R.


[PATCH] PCI: Update ACS quirk for more Intel 10G NICs

2017-07-20 Thread Roland Dreier
From: Roland Dreier <rol...@purestorage.com>

Add one more variant of the 82599 plus the device IDs for X540 and X550
variants.  Intel has confirmed that none of these devices does peer-to-peer
between functions.  The X540 and X550 have added ACS capabilities in their
PCI config space, but the ACS control register is hard-wired to 0 for both
devices, so we still need the quirk for IOMMU grouping to allow assignment
of individual SR-IOV functions.

Signed-off-by: Roland Dreier <rol...@purestorage.com>
---
 drivers/pci/quirks.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6967c6b4cf6b..b939db671326 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4335,12 +4335,33 @@ static const struct pci_dev_acs_enabled {
{ PCI_VENDOR_ID_INTEL, 0x1507, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x1514, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x151C, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x1528, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x1529, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x154A, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x152A, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x154D, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x154F, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x1551, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x1558, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x1560, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x1563, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AA, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AC, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AD, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AE, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15B0, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15AB, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C2, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C3, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C4, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C6, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C7, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15C8, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15CE, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15E4, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15E5, pci_quirk_mf_endpoint_acs },
+   { PCI_VENDOR_ID_INTEL, 0x15D1, pci_quirk_mf_endpoint_acs },
/* 82580 */
{ PCI_VENDOR_ID_INTEL, 0x1509, pci_quirk_mf_endpoint_acs },
{ PCI_VENDOR_ID_INTEL, 0x150E, pci_quirk_mf_endpoint_acs },
-- 
2.11.0



Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Fri, Jul 8, 2016 at 9:51 AM, Jason Gunthorpe
 wrote:
> So, it appears, the dst and neigh can be used for all performances cases.
>
> For the non performance dst == null case, can we just burn cycles and
> stuff the daddr in front of the packet at hardheader time, even if we
> have to copy?

OK, sounds interesting.

Unfortunately the scope of this work has gotten to the point where I
can't take it on right now.  My system is running 4.4.y for now
(before struct skb_gso_cb grew) so I think shrinking struct skb_gso_cb
to 8 bytes plus changing SKB_SGO_CB_OFFSET to 20 will work for now.
Hope someone is able to come up with a real fix before I need to
upgrade to 4.10.y...

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
 wrote:
> We have neighbour_priv, and ndo_neigh_construct/destruct now ..
>
> A first blush that would seem to be enough to let ipoib store the AH
> and other path information in the neigh and avoid the cb? At least the
> example in clip sure looks like what ipoib needs to do.

Do you think those new facilities let us go back to using the neigh
and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
Use a private hash table for path lookup in xmit path")?

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
>> struct skb_gso_cb {
>> int mac_offset;
>> int encap_level;
>> __u16   csum_start;
>> };

> This is based on an out-dated version of this struct.  The 4.7 RC
> kernel has a few more fields that were added to support local checksum
> offload for encapsulated frames.

Thanks for pointing that out.  I hit the perf regression on 4.4.y
(stable) and looked at the struct there.  I see that latest upstream
has changed, and I agree that this struct really can't shrink below 10
bytes.

Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
we're 2 bytes over the 48 that are available in cb[].  So this is
harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
unfortunately.

>> What is the best way to keep the crash fix but not kill IPoIB performance?
>
> It seems like what would probably need to happen is to move where the
> IPoIB address is stored.  I'm not sure the control buffer is really
> the best place for it since the cb gets overwritten at various levels,
> and storing 20 bytes makes it hard to avoid bumping up against the
> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
> generated around the same time we generate the L2 header for the
> frame, I wonder if you couldn't get away with using a bit of extra skb
> headroom to store it and then use a offset from the MAC header to
> access it.  An added bonus would be that with a few tricks with
> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
> that you copy the hwaddr when you copy the header for each fragment
> instead of having to go and copy the hwaddr out of the cb and clone it
> for each frame.

Can we assume there are 20 bytes of skb headroom available?  What if
we're forwarding an skb received on an Ethernet device?

The reason we moved to the cb storage is that in the past, trying to
hide some data in the actual skb buffer that we don't actually send
led to some awkward-at-best code.  (As I recall GRO was difficult to
handle before commit 936d7de3d736 "IPoIB: Stop lying about
hard_header_len and use skb->cb to stash LL addresses")  But maybe
there's a third way to handle this other than the old way and the
skb->cb way.

 - R.


Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov  wrote:
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Resurrecting this old thread, because the patch that ultimately went
upstream (commit 9207f9d45b0a / net: preserve IP control block during
GSO segmentation) causes a huge IPoIB performance regression (to the
point of being unusable):
https://bugzilla.kernel.org/show_bug.cgi?id=111921

I don't think anyone has explained what goes wrong or why IPoIB works
the way it does.  The underlying difference that IPoIB has from other
drivers is that there are two levels of address resolution.  First,
normal ARP / ND resolves an IP address to a "hardware" address.  The
difference is that in IPoIB, the hardware address is an IB GID (plus a
QPN, but we can ignore that).  To actually send data to that GID, the
IPoIB driver has to do a second lookup - it needs to ask the IB subnet
manager for a path record that tells it how to reach that GID.

In particular this means that "destination address" (as the IP / ARP
layer understands it) actually isn't in the packet anywhere - there's
nothing like an ethernet header as there is for "normal" network
drivers.  Instead, the driver stashes the address in skb->cb during
hard_header_ops->create() and then looks at it in the xmit routine -
this was designed way back around when commit a0417fa3a18a / net: Make
qdisc_skb_cb upper size bound explicit. was merged.  The expectation
was that the part of the cb after sizeof (struct qdisc_skb_cb) would
be preserved.

The problem with commit 9207f9d45b0a is that GSO operations now access
cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
where IPoIB stashes its hwaddr.

It seems that the intent of the commit is to preserve the IP control
block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
is

struct skb_gso_cb {
int mac_offset;
int encap_level;
__u16   csum_start;
};

is it feasible to make encap_level a __u16 (which would make the
overall struct exactly 8 bytes)?  If I understand this correctly, 64K
nested encapsulations seems like quite a bit for a packet...

Or, earlier in this thread, having the GSO in ip_output and other gso
paths save and restore the IP/IP6 control block was suggested as an
alternate approach.  I don't know if there are performance
implications to that.

What is the best way to keep the crash fix but not kill IPoIB performance?

Thanks!
 - R.


[PATCH 3.19 and earlier] fib_rules: Fix dump_rules() not to exit early

2015-10-05 Thread Roland Dreier
From: Roland Dreier <rol...@purestorage.com>

Backports of 41fc014332d9 ("fib_rules: fix fib rule dumps across
multiple skbs") introduced a regression in "ip rule show" - it ends up
dumping the first rule over and over and never exiting, because 3.19
and earlier are missing commit 053c095a82cf ("netlink: make
nlmsg_end() and genlmsg_end() void"), so fib_nl_fill_rule() ends up
returning skb->len (i.e. > 0) in the success case.

Fix this by checking the return code for < 0 instead of != 0.

Signed-off-by: Roland Dreier <rol...@purestorage.com>
---
Hi, this is needed for all stable trees earlier than 4.0 that have
picked up 41fc014332d9; so far looks like at least 3.10.y and 3.14.y
have made such releases.

 net/core/fib_rules.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 627e517077e4..84340a2605ed 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -606,7 +606,7 @@ static int dump_rules(struct sk_buff *skb, struct 
netlink_callback *cb,
err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
   cb->nlh->nlmsg_seq, RTM_NEWRULE,
   NLM_F_MULTI, ops);
-   if (err)
+   if (err < 0)
break;
 skip:
idx++;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] fib_rules: fix fib rule dumps across multiple skbs

2015-10-02 Thread Roland Dreier
On Tue, Sep 22, 2015 at 9:40 PM, Roopa Prabhu  wrote:
> +   err = fib_nl_fill_rule(skb, rule, NETLINK_CB(cb->skb).portid,
> +  cb->nlh->nlmsg_seq, RTM_NEWRULE,
> +  NLM_F_MULTI, ops);
> +   if (err)

FWIW I believe this breaks pre-4.0 stable kernels (unfortunately it
just showed up in 3.10.90).  In kernels without 053c095a82cf
("netlink: make nlmsg_end() and genlmsg_end() void") then
fib_nl_fill_rule() returns a positive value (skb->len) on success, so
we break out of the loop here immediately.  Symptom is "ip rule show"
loops forever printing the first rule.

After I finish testing a fix (as trivial as changing to "if (err <
0)") here, I'll send it to -stable guys.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Shift calculation wrong for single sge entries.

2008-02-25 Thread Roland Dreier
Thanks, applied, although I assume based on the Signed-off-by line
that you left out a

  From: Bryan Rosenburg [EMAIL PROTECTED]

at the top (to get the authorship in git correctly).

  RDMA/cxgb3: Shift calculation wrong for single sge entries.

BTW, there's no need to duplicate the subject line in the message
body, but if you are going to do it, please put a Subject: before
it.  Otherwise I just have to edit it out by hand to avoid git putting
the subject in twice.

 - R.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RESEND] libertas: Remove unused exports

2008-02-23 Thread Roland Dreier
Any chance of getting this applied?  It seems the build is still
broken on ia64 at least due to the export of static functions.
---


The libertas driver exports a number of symbols with no in-tree
users; remove these unused exports.  lbs_reset_device() is completely
unused, with no callers at all, so remove the function completely.

A couple of these unused exported symbols are static, which causes the
following build error on ia64 with gcc 4.2.3:

drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh 
causes a section type conflict
drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh 
causes a section type conflict

This patch fixes this problem.  I don't have hardware, so this is not
run-tested, but I tested the build with

CONFIG_LIBERTAS=y
CONFIG_LIBERTAS_USB=m
CONFIG_LIBERTAS_CS=m
CONFIG_LIBERTAS_SDIO=m

and there were no problems with undefined symbols.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
diff --git a/drivers/net/wireless/libertas/cmd.c 
b/drivers/net/wireless/libertas/cmd.c
index eab0203..b3c1acb 100644
--- a/drivers/net/wireless/libertas/cmd.c
+++ b/drivers/net/wireless/libertas/cmd.c
@@ -1040,7 +1040,6 @@ int lbs_mesh_access(struct lbs_private *priv, uint16_t 
cmd_action,
lbs_deb_leave(LBS_DEB_CMD);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_mesh_access);
 
 int lbs_mesh_config(struct lbs_private *priv, uint16_t enable, uint16_t chan)
 {
@@ -1576,7 +1575,6 @@ done:
lbs_deb_leave_args(LBS_DEB_HOST, ret %d, ret);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_prepare_and_send_command);
 
 /**
  *  @brief This function allocates the command buffer and link
diff --git a/drivers/net/wireless/libertas/decl.h 
b/drivers/net/wireless/libertas/decl.h
index aaacd9b..4e22341 100644
--- a/drivers/net/wireless/libertas/decl.h
+++ b/drivers/net/wireless/libertas/decl.h
@@ -69,7 +69,6 @@ struct lbs_private *lbs_add_card(void *card, struct device 
*dmdev);
 int lbs_remove_card(struct lbs_private *priv);
 int lbs_start_card(struct lbs_private *priv);
 int lbs_stop_card(struct lbs_private *priv);
-int lbs_reset_device(struct lbs_private *priv);
 void lbs_host_to_card_done(struct lbs_private *priv);
 
 int lbs_update_channel(struct lbs_private *priv);
diff --git a/drivers/net/wireless/libertas/main.c 
b/drivers/net/wireless/libertas/main.c
index 84fb49c..1eaf6af 100644
--- a/drivers/net/wireless/libertas/main.c
+++ b/drivers/net/wireless/libertas/main.c
@@ -1351,8 +1350,6 @@ done:
lbs_deb_leave_args(LBS_DEB_MESH, ret %d, ret);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_add_mesh);
-
 
 static void lbs_remove_mesh(struct lbs_private *priv)
 {
@@ -1372,7 +1369,6 @@ static void lbs_remove_mesh(struct lbs_private *priv)
free_netdev(mesh_dev);
lbs_deb_leave(LBS_DEB_MESH);
 }
-EXPORT_SYMBOL_GPL(lbs_remove_mesh);
 
 /**
  *  @brief This function finds the CFP in
@@ -1458,20 +1454,6 @@ void lbs_interrupt(struct lbs_private *priv)
 }
 EXPORT_SYMBOL_GPL(lbs_interrupt);
 
-int lbs_reset_device(struct lbs_private *priv)
-{
-   int ret;
-
-   lbs_deb_enter(LBS_DEB_MAIN);
-   ret = lbs_prepare_and_send_command(priv, CMD_802_11_RESET,
-   CMD_ACT_HALT, 0, 0, NULL);
-   msleep_interruptible(10);
-
-   lbs_deb_leave_args(LBS_DEB_MAIN, ret %d, ret);
-   return ret;
-}
-EXPORT_SYMBOL_GPL(lbs_reset_device);
-
 static int __init lbs_init_module(void)
 {
lbs_deb_enter(LBS_DEB_MAIN);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopbackconnections.

2008-02-13 Thread Roland Dreier
   how can a static void function return 0?

good question... I've fixed the patch in my tree.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopbackconnections.

2008-02-13 Thread Roland Dreier
Steve, I had to update the patch adding an include and fixing the
function declaration (as below)... but how much testing have you done
with this??

commit 8704e9a8790cc9e394198663c1c9150c899fb9a2
Author: Steve Wise [EMAIL PROTECTED]
Date:   Tue Feb 12 16:09:29 2008 -0600

RDMA/cxgb3: Fail loopback connections

The cxgb3 HW and driver don't support loopback RDMA connections.  So
fail any connection attempt where the destination address is local.

Signed-off-by: Steve Wise [EMAIL PROTECTED]
Signed-off-by: Roland Dreier [EMAIL PROTECTED]

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c 
b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index e9a08fa..320f2b6 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -35,6 +35,7 @@
 #include linux/skbuff.h
 #include linux/timer.h
 #include linux/notifier.h
+#include linux/inetdevice.h
 
 #include net/neighbour.h
 #include net/netevent.h
@@ -1784,6 +1785,17 @@ err:
return err;
 }
 
+static int is_loopback_dst(struct iw_cm_id *cm_id)
+{
+   struct net_device *dev;
+
+   dev = ip_dev_find(init_net, cm_id-remote_addr.sin_addr.s_addr);
+   if (!dev)
+   return 0;
+   dev_put(dev);
+   return 1;
+}
+
 int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
 {
int err = 0;
@@ -1791,6 +1803,11 @@ int iwch_connect(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
struct iwch_ep *ep;
struct rtable *rt;
 
+   if (is_loopback_dst(cm_id)) {
+   err = -ENOSYS;
+   goto out;
+   }
+
ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
if (!ep) {
printk(KERN_ERR MOD %s - cannot alloc ep.\n, __FUNCTION__);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libertas: Remove unused exports

2008-02-12 Thread Roland Dreier
The libertas driver exports a number of symbols with no in-tree
users; remove these unused exports.  lbs_reset_device() is completely
unused, with no callers at all, so remove the function completely.

A couple of these unused exported symbols are static, which causes the
following build error on ia64 with gcc 4.2.3:

drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh 
causes a section type conflict
drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh 
causes a section type conflict

This patch fixes this problem.  I don't have hardware, so this is not
run-tested, but I tested the build with

CONFIG_LIBERTAS=y
CONFIG_LIBERTAS_USB=m
CONFIG_LIBERTAS_CS=m
CONFIG_LIBERTAS_SDIO=m

and there were no problems with undefined symbols.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
Here's the patch that removes the unused exports (and a few more that
I found).

diff --git a/drivers/net/wireless/libertas/cmd.c 
b/drivers/net/wireless/libertas/cmd.c
index eab0203..b3c1acb 100644
--- a/drivers/net/wireless/libertas/cmd.c
+++ b/drivers/net/wireless/libertas/cmd.c
@@ -1040,7 +1040,6 @@ int lbs_mesh_access(struct lbs_private *priv, uint16_t 
cmd_action,
lbs_deb_leave(LBS_DEB_CMD);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_mesh_access);
 
 int lbs_mesh_config(struct lbs_private *priv, uint16_t enable, uint16_t chan)
 {
@@ -1576,7 +1575,6 @@ done:
lbs_deb_leave_args(LBS_DEB_HOST, ret %d, ret);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_prepare_and_send_command);
 
 /**
  *  @brief This function allocates the command buffer and link
diff --git a/drivers/net/wireless/libertas/decl.h 
b/drivers/net/wireless/libertas/decl.h
index aaacd9b..4e22341 100644
--- a/drivers/net/wireless/libertas/decl.h
+++ b/drivers/net/wireless/libertas/decl.h
@@ -69,7 +69,6 @@ struct lbs_private *lbs_add_card(void *card, struct device 
*dmdev);
 int lbs_remove_card(struct lbs_private *priv);
 int lbs_start_card(struct lbs_private *priv);
 int lbs_stop_card(struct lbs_private *priv);
-int lbs_reset_device(struct lbs_private *priv);
 void lbs_host_to_card_done(struct lbs_private *priv);
 
 int lbs_update_channel(struct lbs_private *priv);
diff --git a/drivers/net/wireless/libertas/main.c 
b/drivers/net/wireless/libertas/main.c
index 84fb49c..1eaf6af 100644
--- a/drivers/net/wireless/libertas/main.c
+++ b/drivers/net/wireless/libertas/main.c
@@ -1351,8 +1350,6 @@ done:
lbs_deb_leave_args(LBS_DEB_MESH, ret %d, ret);
return ret;
 }
-EXPORT_SYMBOL_GPL(lbs_add_mesh);
-
 
 static void lbs_remove_mesh(struct lbs_private *priv)
 {
@@ -1372,7 +1369,6 @@ static void lbs_remove_mesh(struct lbs_private *priv)
free_netdev(mesh_dev);
lbs_deb_leave(LBS_DEB_MESH);
 }
-EXPORT_SYMBOL_GPL(lbs_remove_mesh);
 
 /**
  *  @brief This function finds the CFP in
@@ -1458,20 +1454,6 @@ void lbs_interrupt(struct lbs_private *priv)
 }
 EXPORT_SYMBOL_GPL(lbs_interrupt);
 
-int lbs_reset_device(struct lbs_private *priv)
-{
-   int ret;
-
-   lbs_deb_enter(LBS_DEB_MAIN);
-   ret = lbs_prepare_and_send_command(priv, CMD_802_11_RESET,
-   CMD_ACT_HALT, 0, 0, NULL);
-   msleep_interruptible(10);
-
-   lbs_deb_leave_args(LBS_DEB_MAIN, ret %d, ret);
-   return ret;
-}
-EXPORT_SYMBOL_GPL(lbs_reset_device);
-
 static int __init lbs_init_module(void)
 {
lbs_deb_enter(LBS_DEB_MAIN);
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH 2.6.25] RDMA/cxgb3: Fail loopback connections.

2008-02-12 Thread Roland Dreier
applied, although:

  +static void is_loopback_dst(struct iw_cm_id *cm_id)
  +{
  +struct net_device *dev;
  +
  +dev = ip_dev_find(init_net, cm_id-remote_addr.sin_addr.s_addr);
  +if (!dev)
  +return 0;
  +dev_put(dev);
  +return 1;
  +}

is there any way this could trigger when it should, like if I'm trying
to make a connection from one local device to a different local device
(which should work fine)?

 - R.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: needs vmalloc.h for vmap()

2008-02-11 Thread Roland Dreier
thanks, just merged the same patch from Olof Johansson.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libertas: Don't mark exported symbols as static

2008-02-08 Thread Roland Dreier
Marking exported symbols as static causes the following build error on
ia64 with gcc 4.2.3:

drivers/net/wireless/libertas/main.c:1375: error: __ksymtab_lbs_remove_mesh 
causes a section type conflict
drivers/net/wireless/libertas/main.c:1354: error: __ksymtab_lbs_add_mesh causes 
a section type conflict

Therefore, remove the static marking on lbs_remove_mesh and lbs_add_mesh.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
diff --git a/drivers/net/wireless/libertas/main.c 
b/drivers/net/wireless/libertas/main.c
index 84fb49c..a688ce8 100644
--- a/drivers/net/wireless/libertas/main.c
+++ b/drivers/net/wireless/libertas/main.c
@@ -253,8 +253,8 @@ static ssize_t lbs_anycast_set(struct device *dev,
 
 static int lbs_add_rtap(struct lbs_private *priv);
 static void lbs_remove_rtap(struct lbs_private *priv);
-static int lbs_add_mesh(struct lbs_private *priv);
-static void lbs_remove_mesh(struct lbs_private *priv);
+int lbs_add_mesh(struct lbs_private *priv);
+void lbs_remove_mesh(struct lbs_private *priv);
 
 
 /**
@@ -1296,7 +1296,7 @@ EXPORT_SYMBOL_GPL(lbs_stop_card);
  *  @param privA pointer to the struct lbs_private structure
  *  @return   0 if successful, -X otherwise
  */
-static int lbs_add_mesh(struct lbs_private *priv)
+int lbs_add_mesh(struct lbs_private *priv)
 {
struct net_device *mesh_dev = NULL;
int ret = 0;
@@ -1354,7 +1354,7 @@ done:
 EXPORT_SYMBOL_GPL(lbs_add_mesh);
 
 
-static void lbs_remove_mesh(struct lbs_private *priv)
+void lbs_remove_mesh(struct lbs_private *priv)
 {
struct net_device *mesh_dev;
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libertas: Don't mark exported symbols as static

2008-02-08 Thread Roland Dreier
  Why not pull the exports? they aren't used anywhere in the existing kernel.

I'm guessing there's some not-(yet-)merged mesh networking stuff that
uses the symbols, but it doesn't matter much to me...

 - R.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/2] Skip empty hash buckets faster in /proc/net/tcp

2008-01-30 Thread Roland Dreier
  On a 2GB Core2 system here I see a time cat /proc/net/tcp  /dev/null
  constently dropping from 0.44s to 0.4-0.8s system time with this change.

Seems like there must be a typo in either the before or after times
you report here?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] cxgb3: Remove incorrect __devinit annotations

2008-01-29 Thread Roland Dreier
When PCI error recovery was added to cxgb3, a function t3_io_slot_reset()
was added.  This function can call back into t3_prep_adapter() at any
time, so t3_prep_adapter() can no longer be marked __devinit.
This patch removes the __devinit annotation from t3_prep_adapter() and
all the functions that it calls, which fixes

WARNING: drivers/net/cxgb3/built-in.o(.text+0x2427): Section mismatch in 
reference from the function t3_io_slot_reset() to the function 
.devinit.text:t3_prep_adapter()

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/cxgb3/mc5.c   |2 +-
 drivers/net/cxgb3/sge.c   |2 +-
 drivers/net/cxgb3/t3_hw.c |   22 ++
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/net/cxgb3/mc5.c b/drivers/net/cxgb3/mc5.c
index 84c1ffa..4c4d6e8 100644
--- a/drivers/net/cxgb3/mc5.c
+++ b/drivers/net/cxgb3/mc5.c
@@ -452,7 +452,7 @@ void t3_mc5_intr_handler(struct mc5 *mc5)
t3_write_reg(adap, A_MC5_DB_INT_CAUSE, cause);
 }
 
-void __devinit t3_mc5_prep(struct adapter *adapter, struct mc5 *mc5, int mode)
+void t3_mc5_prep(struct adapter *adapter, struct mc5 *mc5, int mode)
 {
 #define K * 1024
 
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index cb684d3..9ca8c66 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -2836,7 +2836,7 @@ void t3_sge_init(struct adapter *adap, struct sge_params 
*p)
  * defaults for the assorted SGE parameters, which admins can change until
  * they are used to initialize the SGE.
  */
-void __devinit t3_sge_prep(struct adapter *adap, struct sge_params *p)
+void t3_sge_prep(struct adapter *adap, struct sge_params *p)
 {
int i;
 
diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index 7469935..a99496a 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -2675,7 +2675,7 @@ void t3_tp_set_max_rxsize(struct adapter *adap, unsigned 
int size)
 V_PMMAXXFERLEN0(size) | V_PMMAXXFERLEN1(size));
 }
 
-static void __devinit init_mtus(unsigned short mtus[])
+static void init_mtus(unsigned short mtus[])
 {
/*
 * See draft-mathis-plpmtud-00.txt for the values.  The min is 88 so
@@ -2703,7 +2703,7 @@ static void __devinit init_mtus(unsigned short mtus[])
 /*
  * Initial congestion control parameters.
  */
-static void __devinit init_cong_ctrl(unsigned short *a, unsigned short *b)
+static void init_cong_ctrl(unsigned short *a, unsigned short *b)
 {
a[0] = a[1] = a[2] = a[3] = a[4] = a[5] = a[6] = a[7] = a[8] = 1;
a[9] = 2;
@@ -3354,8 +3354,7 @@ out_err:
  * Determines a card's PCI mode and associated parameters, such as speed
  * and width.
  */
-static void __devinit get_pci_mode(struct adapter *adapter,
-  struct pci_params *p)
+static void get_pci_mode(struct adapter *adapter, struct pci_params *p)
 {
static unsigned short speed_map[] = { 33, 66, 100, 133 };
u32 pci_mode, pcie_cap;
@@ -3395,8 +3394,7 @@ static void __devinit get_pci_mode(struct adapter 
*adapter,
  * capabilities and default speed/duplex/flow-control/autonegotiation
  * settings.
  */
-static void __devinit init_link_config(struct link_config *lc,
-  unsigned int caps)
+static void init_link_config(struct link_config *lc, unsigned int caps)
 {
lc-supported = caps;
lc-requested_speed = lc-speed = SPEED_INVALID;
@@ -3419,7 +3417,7 @@ static void __devinit init_link_config(struct link_config 
*lc,
  * Calculates the size of an MC7 memory in bytes from the value of its
  * configuration register.
  */
-static unsigned int __devinit mc7_calc_size(u32 cfg)
+static unsigned int mc7_calc_size(u32 cfg)
 {
unsigned int width = G_WIDTH(cfg);
unsigned int banks = !!(cfg  F_BKS) + 1;
@@ -3430,8 +3428,8 @@ static unsigned int __devinit mc7_calc_size(u32 cfg)
return MBs  20;
 }
 
-static void __devinit mc7_prep(struct adapter *adapter, struct mc7 *mc7,
-  unsigned int base_addr, const char *name)
+static void mc7_prep(struct adapter *adapter, struct mc7 *mc7,
+unsigned int base_addr, const char *name)
 {
u32 cfg;
 
@@ -3517,7 +3515,7 @@ static int t3_reset_adapter(struct adapter *adapter)
return 0;
 }
 
-static int __devinit init_parity(struct adapter *adap)
+static int init_parity(struct adapter *adap)
 {
int i, err, addr;
 
@@ -3552,8 +3550,8 @@ static int __devinit init_parity(struct adapter *adap)
  * for some adapter tunables, take PHYs out of reset, and initialize the MDIO
  * interface.
  */
-int __devinit t3_prep_adapter(struct adapter *adapter,
- const struct adapter_info *ai, int reset)
+int t3_prep_adapter(struct adapter *adapter, const struct adapter_info *ai,
+   int reset)
 {
int ret;
unsigned int i, j = 0;
--
To unsubscribe from

Re: [PATCH 2.6.25] RDMA/cxgb3: Fix the T3A workaround checks.

2008-01-24 Thread Roland Dreier
thanks, applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND 3/3] RDMA/cxgb3: Mark qp as privileged based on user capabilities.

2008-01-22 Thread Roland Dreier
thanks, applied 1-3
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] IPoIB: improve IPv4/IPv6 to IB mcast mapping functions

2008-01-04 Thread Roland Dreier
Any objection to merging the following for 2.6.25?

[Rolf -- I think it makes more sense to delete the overwriting of the
P_Key in ipoib_multicast.c in this patch rather than later in the
series; do you agree?]

Thanks,
  Roland


From: Rolf Manderscheid [EMAIL PROTECTED]

An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs.  The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised.  This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.

The next step will be to add a way to configure the scope for an IPoIB
interface.

Signed-off-by: Rolf Manderscheid [EMAIL PROTECTED]
Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/infiniband/core/cma.c  |4 +---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |4 
 include/net/if_inet6.h |   11 +++
 include/net/ip.h   |   10 ++
 net/ipv4/arp.c |2 +-
 net/ipv6/ndisc.c   |2 +-
 6 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 312ec74..982836e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2610,11 +2610,9 @@ static void cma_set_mgid(struct rdma_id_private *id_priv,
/* IPv6 address is an SA assigned MGID. */
memcpy(mgid, sin6-sin6_addr, sizeof *mgid);
} else {
-   ip_ib_mc_map(sin-sin_addr.s_addr, mc_map);
+   ip_ib_mc_map(sin-sin_addr.s_addr, dev_addr-broadcast, mc_map);
if (id_priv-id.ps == RDMA_PS_UDP)
mc_map[7] = 0x01;   /* Use RDMA CM signature */
-   mc_map[8] = ib_addr_get_pkey(dev_addr)  8;
-   mc_map[9] = (unsigned char) ib_addr_get_pkey(dev_addr);
*mgid = *(union ib_gid *) (mc_map + 4);
}
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 858ada1..2628339 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -788,10 +788,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 
memcpy(mgid.raw, mclist-dmi_addr + 4, sizeof mgid);
 
-   /* Add in the P_Key */
-   mgid.raw[4] = (priv-pkey  8)  0xff;
-   mgid.raw[5] = priv-pkey  0xff;
-
mcast = __ipoib_mcast_find(dev, mgid);
if (!mcast || test_bit(IPOIB_MCAST_FLAG_SENDONLY, 
mcast-flags)) {
struct ipoib_mcast *nmcast;
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 448eccb..b24508a 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -269,18 +269,21 @@ static inline void ipv6_arcnet_mc_map(const struct 
in6_addr *addr, char *buf)
buf[0] = 0x00;
 }
 
-static inline void ipv6_ib_mc_map(struct in6_addr *addr, char *buf)
+static inline void ipv6_ib_mc_map(const struct in6_addr *addr,
+ const unsigned char *broadcast, char *buf)
 {
+   unsigned char scope = broadcast[5]  0xF;
+
buf[0]  = 0;/* Reserved */
buf[1]  = 0xff; /* Multicast QPN */
buf[2]  = 0xff;
buf[3]  = 0xff;
buf[4]  = 0xff;
-   buf[5]  = 0x12; /* link local scope */
+   buf[5]  = 0x10 | scope; /* scope from broadcast address */
buf[6]  = 0x60; /* IPv6 signature */
buf[7]  = 0x1b;
-   buf[8]  = 0;/* P_Key */
-   buf[9]  = 0;
+   buf[8]  = broadcast[8]; /* P_Key */
+   buf[9]  = broadcast[9];
memcpy(buf + 10, addr-s6_addr + 6, 10);
 }
 #endif
diff --git a/include/net/ip.h b/include/net/ip.h
index 840dd91..50c8889 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -266,20 +266,22 @@ static inline void ip_eth_mc_map(__be32 naddr, char *buf)
  * Leave P_Key as 0 to be filled in by driver.
  */
 
-static inline void ip_ib_mc_map(__be32 naddr, char *buf)
+static inline void ip_ib_mc_map(__be32 naddr, const unsigned char *broadcast, 
char *buf)
 {
__u32 addr;
+   unsigned char scope = broadcast[5]  0xF;
+
buf[0]  = 0;/* Reserved */
buf[1]  = 0xff; /* Multicast QPN */
buf[2]  = 0xff;
buf[3]  = 0xff;
addr= ntohl(naddr);
buf[4]  = 0xff;
-   buf[5]  = 0x12; /* link local scope */
+   buf[5]  = 0x10 | scope; /* scope from broadcast address */
buf[6]  = 0x40; /* IPv4 signature */
buf[7]  = 0x1b;
-   buf[8]  = 0;/* P_Key

Bogus commit 70eba18b (make bnx2x select ZLIB_INFLATE)?

2007-12-08 Thread Roland Dreier
Commit 70eba18b (make bnx2x select ZLIB_INFLATE) in Linus's tree
seems bogus.  As far as I can tell, bnx2x is not upstream yet, and the
commit in question actually adds select ZLIB_INFLATE to the TEHUTI
config, since there is no BNX2X config option (and also I don't see
any reference to zlib in any tehuti files).

I assume this is some sort of merge problem that got added to the
wrong tree and should be reverted upstream.

 - R.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: fix lro_gen_skb() alignment

2007-11-30 Thread Roland Dreier
   -   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
   +   skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN);
   NET_IP_ALIGN should only be used if you're DMAing into the skb head.
   Otherwise you should say 2.  It would be nice to have another macro
   for that I suppose.
  
  It is certainly simple enough to say 2.  Thank you for pointing
  this out.  I have attached a patch to do that..
  
  Signed off by: Andrew Gallatin [EMAIL PROTECTED]

Isn't the value of 2 ethernet-specific (to round the 14-byte header up
to 16)?  Given that the rest of the lro code is fairly careful to
calculate mac_hdr_len etc it seems as if it would be cleaner to make
this independent of the specific L2 being used.

(And I plan on using the LRO module for IP-over-InfiniBand so this is
not completely theoretical)

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 2/2] RDMA/cxgb3: Support 5.0 firmware.

2007-11-28 Thread Roland Dreier
OK, applied 1 and 2...

  Note: this change requires 5.0 firmware.

I assume the change to the cxgb3 FW versions is pending in a net
driver change for 2.6.25?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-26 Thread Roland Dreier
  Agreed. On first glance, I was intrigued but:
  
  1) Why is everyone so concerned that export symbol space is large?
   - does it cost cpu or running memory?
   - does it cause bugs?
   - or are you just worried about evil modules?
  
  2) These aren't real namespaces
   - all global names still have to be unique
   - still have to handle the non-modular build namespace conflicts
   - there isn't a big problem with conflicting symbols today.

Perhaps changing the name from namespace to interface would help?
Then a module could have something like

MODULE_USE_INTERFACE(foo);

and I think that makes it clearer what the advantage of this is: it
marks symbols as being part of a certain interface, requires modules
that use that interface to declare that use explicitly, and allows
reviewers to say Hey why is this code using the scsi interface when
it's a webcam driver?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-25 Thread Roland Dreier
  This patch allows to export symbols only for specific modules by 
  introducing symbol name spaces. A module name space has a white
  list of modules that are allowed to import symbols for it; all others
  can't use the symbols.
  
  It adds two new macros: 
  
  MODULE_NAMESPACE_ALLOW(namespace, module);

I definitely like the idea of organizing exported symbols into
namespaces.  However, I feel like it would make more sense to have
something like

MODULE_NAMESPACE_IMPORT(namespace);

inside the modules that use a namespace.  This matches the way C
preprocessor #includes, C++ namespaces, perl use, etc. works and so
it is probably closer to how programmers think.  It does weaken the
enforcement of review a little bit, since there are no changes to the
site where things are exported to catch, but git makes it easy to
sneak that kind of change in anyway.

The practical benefits are that you don't end up with stupid patch
conflicts caused by one patch adding MODULE_NAMESPACE_ALLOW() for
module foo and another patch adding it for module bar at the same
place, and that it becomes simpler for people to test or deliver, say,
a new TCP congestion module without having to rebuild the whole kernel
(which is an especially huge pain for distro kernels).

In any case I would make use of this in whatever form it gets merged
in -- Mellanox ConnectX hardware can operate simultaneously as an
InfiniBand adapter and a 10Gb ethernet NIC, and so my driver is split
up into a low-level mlx4_core driver that exports symbols for other
mlx4 drivers to use; clearly it only makes sense to export them to
other parts of the driver, and in this case the difference between
ALLOW and IMPORT semantics is not a big deal.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-25 Thread Roland Dreier
  Yes, and if a symbol is already used by multiple modules, it's generically 
  useful.  And if so, why restrict it to in-tree modules?

I agree that we shouldn't make things too hard for out-of-tree
modules, but I disagree with your first statement: there clearly is a
large class of symbols that are used by multiple modules but which are
not generically useful -- they are only useful by a certain small class
of modules.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-25 Thread Roland Dreier
   I agree that we shouldn't make things too hard for out-of-tree
   modules, but I disagree with your first statement: there clearly is a
   large class of symbols that are used by multiple modules but which are
   not generically useful -- they are only useful by a certain small class
   of modules.
  
  If it is so clear, you should be able to easily provide examples?

Sure -- Andi's example of symbols required only by TCP congestion
modules; the SCSI internals that Christoph wants to mark; the symbols
exported by my mlx4_core driver (which I admit are currently only used
by the mlx4_ib driver, but which will also be used by at least the
ethernet NIC driver for the same hardware).  I thought this was
already covered repeatedly in the thread and indeed in Andi's code so
there was no need to repeat it...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-25 Thread Roland Dreier
  Except C doesn't have namespaces and this mechanism doesn't create them.  So 
  this is just complete and utter makework; as I said before, noone's going to 
  confuse all those udp_* functions if they're not in the udp namespace.

I don't understand why you're so opposed to organizing the kernel's
exported symbols in a more self-documenting way.  It seems pretty
clear to me that having a mechanism that requires modules to make
explicit which (semi-)internal APIs makes reviewing easier, makes it
easier to communicate please don't use that API to module authors,
and takes at least a small step towards bringing the kernel's exported
API under control.  What's the real downside?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.24] RDMA/cxgb3: Set the max_qp_init_rd_atom attribute.

2007-11-09 Thread Roland Dreier
thanks, applied.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/14 v2] nes: eeprom and phy routines

2007-10-31 Thread Roland Dreier
  +/* TODO: deal with EEPROM endian issues */

This is pretty scary.  Is the driver broken on big-endian systems now?

  +/*
  +Everything you wanted to know about CRC algorithms, but were afraid to ask
  + for fear that errors in your understanding might be detected. Version  : 
  3.

etc etc... can all this be replaced with what's in lib/crc32.c?  (I
hope so)

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/14 v2] nes: OpenFabrics kernel verbs

2007-10-31 Thread Roland Dreier
  +/**
  + * nes_post_send
  + */
  +static int nes_post_send(struct ib_qp *ibqp, struct ib_send_wr *ib_wr,
  +struct ib_send_wr **bad_wr)

  ...

  +switch (ib_wr-opcode) {

  ...

  +if (ib_wr-num_sge  
  nesdev-nesadapter-max_sge) {
  +err = -EINVAL;
  +break;
  +}

  ...

  +default:
  +/* error */
  +err = -EINVAL;
  +break;

looks like if you detect an error while posting a work request, you
break out of the switch statement but just continue through the while
loop going through the list of work reuqests.  Which doesn't seem like
it will work very well.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/14 v2] nes: module and device initialization

2007-10-30 Thread Roland Dreier
  Thanks Roland.  Let me know when you have your branch ready.

OK, I pushed out a neteffect branch at

git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git

This has the driver from your git tree plus some compile fixes and
cleanups (added as separate commits, so you can see what I did).  If
it suits you, let's work against that tree to continue cleaning things
up -- you can send me patches or git pull requests to pick up new things.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/14 v2] nes: module and device initialization

2007-10-26 Thread Roland Dreier
OK, a couple quick review comments and a process comment too:

 - First step in the driver is to kill off a lot of the #ifdefs:

  +#ifdef IRQF_SHARED

The upstream driver really shouldn't have compatibility gunk for older
kernels... just make it build against the kernel it's in.

  +#ifdef OFED_1_2

Same... kernel code shouldn't worry about OFED.

  +#ifdef CONFIG_PCI_MSI
  +if (nesdev-msi_enabled) {
  +pci_disable_msi(pcidev);
  +}
  +#endif

This can be much simpler, because pci_disable_msi() is always
available and is a NOP if the config option is off or MSI is not
enabled.  So you can just unconditionally do

pci_disable_msi(pcidev);

  +#ifdef NES_NAPI

I don't see anything that defines NES_NAPI.  I think for the final
merge we want a NAPI-only driver (ie no ifdef at all)... is there any
performance or other reason to ever build a non-NAPI driver (for a
modern kernel)?

OK, on a process level, my plan is to pull the current driver into a
neteffect branch in my git tree with the intention of merging it for
2.6.25.  I'll let you know when that's ready (probably early next
week).  I'll probably do some cleanups there, and you can send me
cleanup/fix patches against that branch any time too.  We should try
to keep the cycle time short: the interval between the first posting
of this driver and the current one was pretty long, and there's a lot
of cleanup to do to get ready for the next merge window.  Does that
plan make sense?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH 14/14 v2] nes: kernel build infrastructure

2007-10-21 Thread Roland Dreier
  +
  +EXTRA_CFLAGS += -DNES_MINICM

I don't see anyplace NES_MINICM is used.  Delete this line?

  +
  +obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o
  +
  +iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o
  +

Also the file has an extra blank line at the beginning and end.  Might
as well kill them.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/14 v2] nes: module and device initialization

2007-10-19 Thread Roland Dreier
Thanks... I am kind of overloaded trying to handle the last few things
for the 2.6.24 merge window, but I will look at this next week, and I
expect we should be able to merge the driver for 2.6.25 unless there
are unexpected hangups.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/14 v2] nes: device structures and defines

2007-10-19 Thread Roland Dreier
  You are starting off on the wrong foot.

???

   +if(!(expr)) { 
   \
   +  printk(KERN_ERR PFX Assertion failed! %s, %s, %s, line %d\n,  \
   + #expr, __FILE__, __FUNCTION__, __LINE__);
   \
   +}
  
  Use BUG_ON

I agree that there's no need to invent a driver-private assertion
macro, but (to first order at least) drivers should never use BUG_ON.
I don't want some glitch in a network driver that the system could
probably survive to be turned into a panic by BUG_ON -- WARN_ON seems
infinitely preferable.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] [MV643XX_ETH] Remove SHARED_REGS register address bias

2007-10-18 Thread Roland Dreier
  +static void __iomem *mv643xx_eth_base;

  +return readl(((void __iomem *)mv643xx_eth_base) + offset);

Given the declaration of mv643xx_eth_base as void __iomem * already, I
don't understand why you need the cast to the same type here (and
elsewhere in the driver).

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bonding support for eth1394?

2007-10-13 Thread Roland Dreier
  The bonding sources have a few occurrences of EOPNOTSUPP.  Unless I
  missed something, they are all related to setting the hardware address
  of the interface.  AFAICS this is impossible with IP over FireWire.  If
  it is crucial to bonding to be able to change the slaves' hardware
  addresses, then you are out of luck.

There are a few changes to the bonding driver pending that will add
support for bonding IP-over-InfiniBand interfaces.  IPoIB also cannot
change its HW address, so the patches address that issue.

Once those patches land, bonding eth1394 interfaces may just work.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-10-11 Thread Roland Dreier
  It happens only when ib interfaces are slaves of a bonding device.
  I thought before that the stuck is in napi_disable() but it's almost right.
  I put prints before and after call to napi_disable and see that it is called 
  twice.
  I'll try to investigate in this direction.
  
  ib0: stopping interface
  ib0: before napi_disable
  ib0: after napi_disable
  ib0: downing ib_dev
  ib0: All sends and receives done.
  ib0: stopping interface
  ib0: before napi_disable

Yes, two napi_disable()s in a row without a matching napi_enable()
will deadlock.  I guess the question is why the ipoib interface is
being stopped twice.

If you just take the net-2.6.24 tree (without bonding patches), does
bonding for ethernet interfaces work OK, or is there a similar problem
with double napi_disable()?  How about bonding of ethernet after this
batch of bonding patches?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-10-10 Thread Roland Dreier
  I also ran a test for the code in the branch of 2.6.24 and found a problem.
  I see that ifconfig down doesn't return (for IPoIB interfaces) and it's 
  stuck in napi_disable() in the kernel (any idea why?)

For what it's worth, I took the upstream 2.6.23 git tree and merged in
Dave's latest net-2.6.24 tree and my latest for-2.6.24 tree and tried
that.  I brought up an IPoIB interface, sent a few pings, and did
ifconfig down, and it worked fine.

Can you try the same thing without the bonding patches to see if your
setup works OK too?

Also can you give more details about what you do to get ifconfig down stuck?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.
  
  I can just threaten to do them all and that should get the driver
  maintainers going :-)

I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
my TODO list so I don't forget.

In fact if 2.6.23 drags on long enough I may do it for 2.6.24
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

I still haven't done much.  OK, I will try to get my board booting
again this week.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

OK, thanks for the kick in the pants, I have a couple of patches for
net-2.6.24 coming (including an unrelated trivial warning fix for
IPoIB).

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct
net_device objects.) changed the interface to NAPI polling.  Fix up
the ibm_emac driver so that it works with this new interface.  This is
actually a nice cleanup because ibm_emac is one of the drivers that
wants to have multiple NAPI structures for a single net_device.

Tested with the internal MAC of a PowerPC 440SPe SoC with an AMCC
'Yucca' evaluation board.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_emac/ibm_emac_mal.c |   48 --
 drivers/net/ibm_emac/ibm_emac_mal.h |2 +-
 include/linux/netdevice.h   |   10 +++
 3 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ibm_emac/ibm_emac_mal.c 
b/drivers/net/ibm_emac/ibm_emac_mal.c
index cabd984..cc3ddc9 100644
--- a/drivers/net/ibm_emac/ibm_emac_mal.c
+++ b/drivers/net/ibm_emac/ibm_emac_mal.c
@@ -207,10 +207,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance)
 
 static inline void mal_schedule_poll(struct ibm_ocp_mal *mal)
 {
-   if (likely(netif_rx_schedule_prep(mal-poll_dev))) {
+   if (likely(napi_schedule_prep(mal-napi))) {
MAL_DBG2(%d: schedule_poll NL, mal-def-index);
mal_disable_eob_irq(mal);
-   __netif_rx_schedule(mal-poll_dev);
+   __napi_schedule(mal-napi);
} else
MAL_DBG2(%d: already in poll NL, mal-def-index);
 }
@@ -273,11 +273,11 @@ static irqreturn_t mal_rxde(int irq, void *dev_instance)
return IRQ_HANDLED;
 }
 
-static int mal_poll(struct net_device *ndev, int *budget)
+static int mal_poll(struct napi_struct *napi, int budget)
 {
-   struct ibm_ocp_mal *mal = ndev-priv;
+   struct ibm_ocp_mal *mal = container_of(napi, struct ibm_ocp_mal, napi);
struct list_head *l;
-   int rx_work_limit = min(ndev-quota, *budget), received = 0, done;
+   int received = 0;
 
MAL_DBG2(%d: poll(%d) %d - NL, mal-def-index, *budget,
 rx_work_limit);
@@ -295,38 +295,34 @@ static int mal_poll(struct net_device *ndev, int *budget)
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
list_entry(l, struct mal_commac, poll_list);
-   int n = mc-ops-poll_rx(mc-dev, rx_work_limit);
+   int n = mc-ops-poll_rx(mc-dev, budget);
if (n) {
received += n;
-   rx_work_limit -= n;
-   if (rx_work_limit = 0) {
-   done = 0;
+   budget -= n;
+   if (budget = 0)
goto more_work; // XXX What if this is the last 
one ?
-   }
}
}
 
/* We need to disable IRQs to protect from RXDE IRQ here */
local_irq_disable();
-   __netif_rx_complete(ndev);
+   __napi_complete(napi);
mal_enable_eob_irq(mal);
local_irq_enable();
 
-   done = 1;
-
/* Check for rotting packet(s) */
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
list_entry(l, struct mal_commac, poll_list);
if (unlikely(mc-ops-peek_rx(mc-dev) || mc-rx_stopped)) {
MAL_DBG2(%d: rotting packet NL, mal-def-index);
-   if (netif_rx_reschedule(ndev, received))
+   if (napi_reschedule(napi))
mal_disable_eob_irq(mal);
else
MAL_DBG2(%d: already in poll list NL,
 mal-def-index);
 
-   if (rx_work_limit  0)
+   if (budget  0)
goto again;
else
goto more_work;
@@ -335,12 +331,8 @@ static int mal_poll(struct net_device *ndev, int *budget)
}
 
   more_work:
-   ndev-quota -= received;
-   *budget -= received;
-
-   MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, *budget,
-done ? 0 : 1);
-   return done ? 0 : 1;
+   MAL_DBG2(%d: poll() %d - %d NL, mal-def-index, budget, received);
+   return received;
 }
 
 static void mal_reset(struct ibm_ocp_mal *mal)
@@ -425,11 +417,8 @@ static int __init mal_probe(struct ocp_device *ocpdev)
mal-def = ocpdev-def;
 
INIT_LIST_HEAD(mal-poll_list);
-   set_bit(__LINK_STATE_START, mal-poll_dev.state);
-   mal-poll_dev.weight = CONFIG_IBM_EMAC_POLL_WEIGHT;
-   mal-poll_dev.poll = mal_poll;
-   mal-poll_dev.priv = mal;
-   atomic_set(mal-poll_dev.refcnt, 1);
+   mal-napi.weight = CONFIG_IBM_EMAC_POLL_WEIGHT;
+   mal-napi.poll = mal_poll;
 
INIT_LIST_HEAD(mal-list);
 
@@ -520,11 +509,8 @@ static void __exit mal_remove(struct ocp_device *ocpdev)
 
MAL_DBG(%d: remove NL, mal-def

[PATCH 3/4] ibm_new_emac: Nuke SET_MODULE_OWNER() use

2007-10-09 Thread Roland Dreier
Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_newemac/core.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ibm_newemac/core.c b/drivers/net/ibm_newemac/core.c
index ce127b9..8ea5009 100644
--- a/drivers/net/ibm_newemac/core.c
+++ b/drivers/net/ibm_newemac/core.c
@@ -2549,7 +2549,6 @@ static int __devinit emac_probe(struct of_device *ofdev,
dev-ndev = ndev;
dev-ofdev = ofdev;
dev-blist = blist;
-   SET_MODULE_OWNER(ndev);
SET_NETDEV_DEV(ndev, ofdev-dev);
 
/* Initialize some embedded data structures */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier
Commit da3dedd9 ([NET]: Make NAPI polling independent of struct
net_device objects.) changed the interface to NAPI polling.  Fix up
the ibm_newemac driver so that it works with this new interface.  This
is actually a nice cleanup because ibm_newemac is one of the drivers
that wants to have multiple NAPI structures for a single net_device.

Compile-tested only as I don't have a system that uses the ibm_newemac
driver.  This conversion the conversion for the ibm_emac driver that
was tested on real PowerPC 440SPe hardware.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 drivers/net/ibm_newemac/mal.c |   55 ++--
 drivers/net/ibm_newemac/mal.h |2 +-
 2 files changed, 20 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ibm_newemac/mal.c b/drivers/net/ibm_newemac/mal.c
index c4335b7..5885411 100644
--- a/drivers/net/ibm_newemac/mal.c
+++ b/drivers/net/ibm_newemac/mal.c
@@ -235,10 +235,10 @@ static irqreturn_t mal_serr(int irq, void *dev_instance)
 
 static inline void mal_schedule_poll(struct mal_instance *mal)
 {
-   if (likely(netif_rx_schedule_prep(mal-poll_dev))) {
+   if (likely(napi_schedule_prep(mal-napi))) {
MAL_DBG2(mal, schedule_poll NL);
mal_disable_eob_irq(mal);
-   __netif_rx_schedule(mal-poll_dev);
+   __napi_schedule(mal-napi);
} else
MAL_DBG2(mal, already in poll NL);
 }
@@ -318,8 +318,7 @@ void mal_poll_disable(struct mal_instance *mal, struct 
mal_commac *commac)
msleep(1);
 
/* Synchronize with the MAL NAPI poller. */
-   while (test_bit(__LINK_STATE_RX_SCHED, mal-poll_dev.state))
-   msleep(1);
+   napi_disable(mal-napi);
 }
 
 void mal_poll_enable(struct mal_instance *mal, struct mal_commac *commac)
@@ -330,11 +329,11 @@ void mal_poll_enable(struct mal_instance *mal, struct 
mal_commac *commac)
// XXX might want to kick a poll now...
 }
 
-static int mal_poll(struct net_device *ndev, int *budget)
+static int mal_poll(struct napi_struct *napi, int budget)
 {
-   struct mal_instance *mal = netdev_priv(ndev);
+   struct mal_instance *mal = container_of(napi, struct mal_instance, 
napi);
struct list_head *l;
-   int rx_work_limit = min(ndev-quota, *budget), received = 0, done;
+   int received = 0;
unsigned long flags;
 
MAL_DBG2(mal, poll(%d) %d - NL, *budget,
@@ -358,26 +357,21 @@ static int mal_poll(struct net_device *ndev, int *budget)
int n;
if (unlikely(test_bit(MAL_COMMAC_POLL_DISABLED, mc-flags)))
continue;
-   n = mc-ops-poll_rx(mc-dev, rx_work_limit);
+   n = mc-ops-poll_rx(mc-dev, budget);
if (n) {
received += n;
-   rx_work_limit -= n;
-   if (rx_work_limit = 0) {
-   done = 0;
-   // XXX What if this is the last one ?
-   goto more_work;
-   }
+   budget -= n;
+   if (budget = 0)
+   goto more_work; // XXX What if this is the last 
one ?
}
}
 
/* We need to disable IRQs to protect from RXDE IRQ here */
spin_lock_irqsave(mal-lock, flags);
-   __netif_rx_complete(ndev);
+   __napi_complete(napi);
mal_enable_eob_irq(mal);
spin_unlock_irqrestore(mal-lock, flags);
 
-   done = 1;
-
/* Check for rotting packet(s) */
list_for_each(l, mal-poll_list) {
struct mal_commac *mc =
@@ -387,12 +381,12 @@ static int mal_poll(struct net_device *ndev, int *budget)
if (unlikely(mc-ops-peek_rx(mc-dev) ||
 test_bit(MAL_COMMAC_RX_STOPPED, mc-flags))) {
MAL_DBG2(mal, rotting packet NL);
-   if (netif_rx_reschedule(ndev, received))
+   if (napi_reschedule(napi))
mal_disable_eob_irq(mal);
else
MAL_DBG2(mal, already in poll list NL);
 
-   if (rx_work_limit  0)
+   if (budget  0)
goto again;
else
goto more_work;
@@ -401,13 +395,8 @@ static int mal_poll(struct net_device *ndev, int *budget)
}
 
  more_work:
-   ndev-quota -= received;
-   *budget -= received;
-
-   MAL_DBG2(mal, poll() %d - %d NL, *budget,
-done ? 0 : 1);
-
-   return done ? 0 : 1;
+   MAL_DBG2(mal, poll() %d - %d NL, budget, received);
+   return received;
 }
 
 static void mal_reset(struct mal_instance *mal)
@@ -538,11 +527,8 @@ static int __devinit mal_probe(struct of_device *ofdev,
}
 
INIT_LIST_HEAD(mal

Re: [ofa-general] [PATCH 4/4] ibm_emac: Convert to use napi_struct independent of struct net_device

2007-10-09 Thread Roland Dreier
Sorry... wrong subject here; it should have been ibm_newemac: ...

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-10-05 Thread Roland Dreier
  I tested this by simulating a slow passive side responder, and it worked as
  expected for those tests.  Using an MRA does add another MAD to the CM 
  exchange,
  which is why it is sent only after seeing a duplicate request.  
  Alternatively,
  we can take the OFED module parameter patch.

What the heck, I added this for 2.6.24.  If it doesn't work out we can
back it out.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Vague maybe ppp-related panic report for 2.6.23-rc9

2007-10-04 Thread Roland Dreier
I'm running 2.6.23-rc9 on my laptop, and when in a coffee shop I use
a Verizon EVDO card to get network access.  This is a kyocera device
that looks like a serial adapter behind an ohci usb controller, and
uses the airprime driver (for usb device 0c88:17da).  The actual IP
networking is ppp over that serial adapter.

The connection is slightly flaky (I blame this on the Verizon
coverage), so I end up restarting things every so often.  The point of
this email (at last) is that I've just seen my second panic (only
info: blinking caps lock led, since I've been in X) that seems
correlated to stopping/restarting pppd.

Sorry for the lack of detail -- I've just switched to running in the
console so if I can provoke the crash again I'll get a little more
info.  I just wanted to mention this in case someone has seen
something similar or has any good ideas about how to capture debugging
output on a laptop (when I'm not home and have no other boxes handy).

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Vague maybe ppp-related panic report for 2.6.23-rc9

2007-10-04 Thread Roland Dreier
  I don't want to jump the gun on the analysis but it just might
  be the packet sharing fixes Herbert put in a short time ago.
  
  What you could do is go back to say rc2 and see if you still get
  the panics, then bisect from there to narrow it down.
  
  If rc2 still gives the panic, it's something else, perhaps device
  specific.

OK thanks.  I'm still trying to find a good way to reproduce it, so
I'm going to try to get it to happen with -rc9 while running in the
console, and at least take a picture of the backtrace.  If I find a
good way to make it happen then I'll try rc2 and let you know.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mac80211: Fix TX after monitor interface is converted to managed

2007-10-04 Thread Roland Dreier
  Programming with assertions (and BUG_ON is a form of that) is
  generally a good practice.  Almost any book or other source on
  good programming practices will agree.  Yes, it can be overdone.
  But I don't really think that is the case here, since the check is
  relatively inexpensive and the consequence should it ever *somehow*
  happen could be a something wierd (crash, corruption, etc) w/o any
  other indication of what occured.

The problem with BUG_ON is that it kills the whole system.  So every
time you add a BUG_ON into code, you have to weigh whether the problem
you detected is so severe that the right response is to panic.  For
example, I can see panicking on something fundamental like corrupted
page tables.  However I would submit that the wireless stack should
*never* use BUG_ON -- printing a warning and trying to limp on seems
preferable to me.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Vague maybe ppp-related panic report for 2.6.23-rc9

2007-10-04 Thread Roland Dreier
Just as a quick update -- I seem to only be able to reproduce this
crash when my ppp session drops, which seems associated with marginal
signal.  And unfortunately I have great coverage at home so I haven't
been able to reproduce this again today.  Maybe on the train tomorrow
I can crash my laptop...

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-10-02 Thread Roland Dreier
  OK -- just to make sure I'm understanding what you're saying: have you
  confirmed that your proposed [CM MRA] patches actually fix the issue?
  
  Not directly.  I cannot easily test kernel patches on our larger, production
  clusters.  We've seen the issue with specific applications on 512 and 1024
  cores, but I've only been able to test the patch on a 48-core cluster.  I 
  have
  verified that it successfully increases the timeout to where it *should* 
  work,
  but cannot absolutely confirm that it will fix the problem.  I'm unlikely to
  know that until the production clusters move to an OFED release (1.3?)
  containing this patch.

Umm... this is a difficult situation for me to merge the changes then.
We're changing the CM retry behavior blind here.  How do we know that
the MRA changes don't make the scalability issue worse?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-02 Thread Roland Dreier
  It would be really great to see numbers with a more recent kernel
  than 2.6.18

FWIW Debian has binaries for 2.6.21 in testing and for 2.6.22 in
unstable so it should be very easy for Larry to try at least those.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IPoIB: Convert to netdevice internal stats

2007-09-28 Thread Roland Dreier
Use the stats member of struct netdevice in IPoIB, so we can save
memory by deleting the stats member of struct ipoib_dev_priv, and save
code by deleting ipoib_get_stats().

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
Dave, can you queue this in net-2.6.24 please?  I would ordinarily
merge IPoIB changes but since this depends on the netdevice internal
stats change it becomes a cross-tree dependency if I try to do that.
And I'd like to get it queued in git now before the merge window.

Thanks...

 drivers/infiniband/ulp/ipoib/ipoib.h   |2 --
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|   20 ++--
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|   18 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |   22 +++---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   10 +-
 5 files changed, 31 insertions(+), 41 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3a6ef14..1e627ee 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -300,8 +300,6 @@ struct ipoib_dev_priv {
 
struct ib_event_handler event_handler;
 
-   struct net_device_stats stats;
-
struct net_device *parent;
struct list_head child_intfs;
struct list_head list;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 08b4676..1afd93c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -430,7 +430,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
ipoib_dbg(priv, cm recv error 
   (status=%d, wrid=%d vend_err %x)\n,
   wc-status, wr_id, wc-vendor_err);
-   ++priv-stats.rx_dropped;
+   ++dev-stats.rx_dropped;
goto repost;
}
 
@@ -457,7 +457,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
 * this packet and reuse the old buffer.
 */
ipoib_dbg(priv, failed to allocate receive buffer %d\n, 
wr_id);
-   ++priv-stats.rx_dropped;
+   ++dev-stats.rx_dropped;
goto repost;
}
 
@@ -474,8 +474,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
skb_pull(skb, IPOIB_ENCAP_LEN);
 
dev-last_rx = jiffies;
-   ++priv-stats.rx_packets;
-   priv-stats.rx_bytes += skb-len;
+   ++dev-stats.rx_packets;
+   dev-stats.rx_bytes += skb-len;
 
skb-dev = dev;
/* XXX get correct PACKET_ type here */
@@ -512,8 +512,8 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
if (unlikely(skb-len  tx-mtu)) {
ipoib_warn(priv, packet len %d ( %d) too long to send, 
dropping\n,
   skb-len, tx-mtu);
-   ++priv-stats.tx_dropped;
-   ++priv-stats.tx_errors;
+   ++dev-stats.tx_dropped;
+   ++dev-stats.tx_errors;
ipoib_cm_skb_too_long(dev, skb, tx-mtu - IPOIB_ENCAP_LEN);
return;
}
@@ -532,7 +532,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
tx_req-skb = skb;
addr = ib_dma_map_single(priv-ca, skb-data, skb-len, DMA_TO_DEVICE);
if (unlikely(ib_dma_mapping_error(priv-ca, addr))) {
-   ++priv-stats.tx_errors;
+   ++dev-stats.tx_errors;
dev_kfree_skb_any(skb);
return;
}
@@ -542,7 +542,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
if (unlikely(post_send(priv, tx, tx-tx_head  (ipoib_sendq_size - 1),
addr, skb-len))) {
ipoib_warn(priv, post_send failed\n);
-   ++priv-stats.tx_errors;
+   ++dev-stats.tx_errors;
ib_dma_unmap_single(priv-ca, addr, skb-len, DMA_TO_DEVICE);
dev_kfree_skb_any(skb);
} else {
@@ -580,8 +580,8 @@ static void ipoib_cm_handle_tx_wc(struct net_device *dev, 
struct ipoib_cm_tx *tx
ib_dma_unmap_single(priv-ca, tx_req-mapping, tx_req-skb-len, 
DMA_TO_DEVICE);
 
/* FIXME: is this right? Shouldn't we only increment on success? */
-   ++priv-stats.tx_packets;
-   priv-stats.tx_bytes += tx_req-skb-len;
+   ++dev-stats.tx_packets;
+   dev-stats.tx_bytes += tx_req-skb-len;
 
dev_kfree_skb_any(tx_req-skb);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index b664b98..1a77e79 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -208,7 +208,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, 
struct ib_wc *wc)
 * this packet and reuse the old buffer

Re: [PATCH] IPoIB: Convert to netdevice internal stats

2007-09-28 Thread Roland Dreier
  How is that ibm_emac NAPI conversion coming along? :-)

Sorry, trying to reduce my backlog first, but it is still on my list
of things to work on :)

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver

2007-09-25 Thread Roland Dreier
   Roland, are you comfortable with the IB changes in patches 1 and 2?

Yes, they look fine to me.  Feel free to apply, with

Acked-by: Roland Dreier [EMAIL PROTECTED]

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister

2007-09-23 Thread Roland Dreier
  The action in bonding to a detach of slave is to unregister the master (see 
  patch 10).
  This can't be done from the context of unregister_netdevice itself (it is 
  protected by rtnl_lock).

I'm confused.  Your patch has:

  +ipoib_slave_detach(cpriv-dev);
   unregister_netdev(cpriv-dev);

And ipoib_slave_detach() is:

  +static inline void ipoib_slave_detach(struct net_device *dev)
  +{
  +rtnl_lock();
  +netdev_slave_detach(dev);
  +rtnl_unlock();
  +}

so you are calling netdev_slave_detach() with the rtnl lock held.
Why can't you make the same call from the start of unregister_netdevice()?

Anyway, if the rtnl lock is a problem, can you just add the call to
netdev_slave_detach() to unregister_netdev() before it takes the rtnl lock?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-21 Thread Roland Dreier
  Just change the makefiles to always install gzip'ed modules
  modutils knows how to unzip them on the fly.

But that leaves the uncompressed firmware blobs in .data that ends up
in unswappable kernel memory.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/11] IB/ipoib: Notify the world before doing unregister

2007-09-20 Thread Roland Dreier
  +ipoib_slave_detach(cpriv-dev);
   unregister_netdev(cpriv-dev);

Maybe you already answered this before, but I'm still not clear why
this notifier call can't just be added to the start of
unregister_netdevice(), so we can avoid having driver needing to know
anything about bonding internals?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery.

2007-09-20 Thread Roland Dreier
  Roland - can you please queue this up for 2.6.24?

Done, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-19 Thread Roland Dreier
  One other thing I observed is that I can not unload the module as the
  ref counter of the eth device is too low. I haven't tracked down the 
  source of this problem yet.

I suspect that this is because netif_rx_reschedule() was missing a
dev_hold() to match the dev_put() in netif_rx_complete().  Dave merged
a fix for that problem yesterday, so current net-2.6.24 should be OK.
Let us know if it's not OK, because then there's another bug...

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement.

2007-09-19 Thread Roland Dreier
Thanks for the patch...

  If an application is calling rdma_resolve_ip() and a status of -ENODATA is 
  returned from addr_resolve_local/remote(), the timeout mechanism waits until 
  the application's timeout occurs before rechecking the address resolution 
  status; the application will wait until it's full timeout occurs.  This case 
  is seen when the work thread call to process_req() is made before the arp 
  packet is processed.

I'm having a hard time understanding this changelog.  Could you please
resend with a description that lets me understand:

 - What the current behavior is and what is wrong with that;
 - What the behavior should be;
 - And how your patch changes the behavior to be correct.

  This patch is in addition to Steve Wise's neigh_event_send patch to initiate 
  neighbour discovery sent on 9/12/2007.

Does this mean it depends on Steve's patch being applied first?

Also please try to keep lines in the changelog to 72 characters or so...

  @@ -136,6 +137,7 @@ static void set_timeout(unsigned long ti
   static void queue_req(struct addr_req *req)
   {
   struct addr_req *temp_req;
  +unsigned long req_timeout = msecs_to_jiffies(MIN_ADDR_TIMEOUT_MS) + 
  jiffies;
   
   mutex_lock(lock);
   list_for_each_entry_reverse(temp_req, req_list, list) {
  @@ -145,8 +147,10 @@ static void queue_req(struct addr_req *r
   
   list_add(req-list, temp_req-list);
   
  -if (req_list.next == req-list)
  +if (req_list.next == req-list) {
  +req_timeout = min(req_timeout, req-timeout);
   set_timeout(req-timeout);
  +}
   mutex_unlock(lock);
   }

I don't understand this code.  It seems you keep track of the minimum
timeout, and then ignore the value you computed.  What am I missing?

Thanks,
  Roland
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister

2007-09-19 Thread Roland Dreier
  I think that if there are no other comments, I will submit the entire 11 
  patches 
  again (with changes) to make it easier to merge into the kernel. Since the 
  most of the 
  content in the patch series is in bonding I thought it would be right that 
  Jay will 
  push all the patches to the networking git. Is it OK with you Roland?

Yes, that's fine.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang

2007-09-18 Thread Roland Dreier
Thanks for testing on ehca...

  While using IPoIB over EHCA (rc6 bits), unregister_netdev hangs with

I don't think you're actually using rc6 bits, since in your patch you have:

  -poll_more:

and I think that is only in Dave's net-2.6.24 tree now, right?

  The problem is that the poll handler does netif_rx_complete (which
  does a dev_put) followed by netif_rx_reschedule() to schedule for
  more receives (which again does a dev_put). This reduces refcount to
   0 (depending on how many times netif_rx_complete followed by
  netif_rx_reschedule was called).

Dave, the real problem seems to be that netif_rx_recschedule() calls
__napi_schedule() rather than __netif_rx_schedule(), so it misses the
call to dev_hold() that is needed to balance the dev_put() in
netif_rx_complete().  The current netif_rx_reschedule() looks like it
really should be napi_reschedule(), and we need a new function that
takes a netdev too.  Or am I misunderstanding the refcounting?

I'll send a patch once I've had some breakfast and had a chance to at
least compile it...

Krishna, unfortunately your proposed fix has a race:

  -netif_rx_complete(dev, napi);
  -if (unlikely(ib_req_notify_cq(priv-cq,
  -  IB_CQ_NEXT_COMP |
  -  IB_CQ_REPORT_MISSED_EVENTS)) 
  -netif_rx_reschedule(napi))
  -goto poll_more;
  +if (likely(!ib_req_notify_cq(priv-cq,
  + IB_CQ_NEXT_COMP |
  + IB_CQ_REPORT_MISSED_EVENTS)))

It is possible for an interrupt to happen immediately right here,
before the netif_rx_complete(), so that netif_rx_schedule() gets
called while we are still on the poll list.

  +netif_rx_complete(dev, napi);

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.24

2007-09-18 Thread Roland Dreier
  Roland, could you merge the common TX CQ patch please?
  It actually fixes a real problem.

Yes, I will, but it collides with the net-2.6.24 NAPI rework I think,
so it may not go in until a few days after the merge window.

Have you verified that the patch cures the interrupt overload issues?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister

2007-09-18 Thread Roland Dreier
  Maybe this new notification function should be in net/core/dev.c
  instead of exporting call_netdevice_notifiers()?

Or actually, does it work to add the call to the notifiers directly in
unregister_netdev() so that device drivers don't have to worry about it?

(And is the existing patch missing a call to notifiers in ipoib_vlan_delete()?)

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.24] Fix documentation for dev_put()/dev_hold()

2007-09-18 Thread Roland Dreier
It looks like the comments for dev_put() and dev_hold() got reversed somehow.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
 include/linux/netdevice.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be5fe05..239ae68 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1030,7 +1030,7 @@ extern intnetdev_budget;
 extern void netdev_run_todo(void);
 
 /**
- * dev_put - get reference to device
+ * dev_put - release reference to device
  * @dev: network device
  *
  * Hold reference to device to keep it from being freed.
@@ -1041,7 +1041,7 @@ static inline void dev_put(struct net_device *dev)
 }
 
 /**
- * dev_hold - release reference to device
+ * dev_hold - get reference to device
  * @dev: network device
  *
  * Release reference to device to allow it to be freed.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule()

2007-09-18 Thread Roland Dreier
netif_rx_complete() takes a netdev parameter and does dev_put() on
that netdev, so netif_rx_reschedule() needs to also take a netdev
parameter and do dev_hold() on it to avoid reference counts from
getting becoming negative because of unbalanced dev_put()s.

This should fix the problem reported by Krishna Kumar
[EMAIL PROTECTED] with IPoIB waiting forever for netdev refcounts
to become 0 during module unload.

Signed-off-by: Roland Dreier [EMAIL PROTECTED]
---
Dave, feel free to roll this up into earlier NAPI conversion patches
(assuming I'm understanding things correctly and this patch actually
makes sense!).

BTW, it looks like drivers/net/ibm_emac/ibm_emac_mal.c would not have 
built in the current net-2.6.24 tree, since its call to
netif_rx_reschedule() was left with the netdev parameter.  So that
file does not need to be touched in this patch.

 drivers/infiniband/ulp/ipoib/ipoib_ib.c |2 +-
 drivers/net/arm/ep93xx_eth.c|2 +-
 drivers/net/ehea/ehea_main.c|2 +-
 drivers/net/ibmveth.c   |2 +-
 include/linux/netdevice.h   |   21 +++--
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 6a2bff4..481e4b6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -320,7 +320,7 @@ poll_more:
if (unlikely(ib_req_notify_cq(priv-cq,
  IB_CQ_NEXT_COMP |
  IB_CQ_REPORT_MISSED_EVENTS)) 
-   netif_rx_reschedule(napi))
+   netif_rx_reschedule(dev, napi))
goto poll_more;
}
 
diff --git a/drivers/net/arm/ep93xx_eth.c b/drivers/net/arm/ep93xx_eth.c
index f3858d1..7f016f3 100644
--- a/drivers/net/arm/ep93xx_eth.c
+++ b/drivers/net/arm/ep93xx_eth.c
@@ -309,7 +309,7 @@ poll_some_more:
}
spin_unlock_irq(ep-rx_lock);
 
-   if (more  netif_rx_reschedule(napi))
+   if (more  netif_rx_reschedule(dev, napi))
goto poll_some_more;
}
 
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 4a5ab4a..9a499f4 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -636,7 +636,7 @@ static int ehea_poll(struct napi_struct *napi, int budget)
if (!cqe  !cqe_skb)
return 0;
 
-   if (!netif_rx_reschedule(napi))
+   if (!netif_rx_reschedule(dev, napi))
return 0;
}
 
diff --git a/drivers/net/ibmveth.c b/drivers/net/ibmveth.c
index b8d7cec..b94f266 100644
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -973,7 +973,7 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
netif_rx_complete(netdev, napi);
 
if (ibmveth_rxq_pending_buffer(adapter) 
-   netif_rx_reschedule(napi)) {
+   netif_rx_reschedule(netdev, napi)) {
lpar_rc = h_vio_signal(adapter-vdev-unit_address,
   VIO_IRQ_DISABLE);
goto restart_poll;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index be5fe05..0dbf185 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1198,16 +1198,6 @@ static inline u32 netif_msg_init(int debug_value, int 
default_msg_enable_bits)
return (1  debug_value) - 1;
 }
 
-/* Try to reschedule poll. Called by dev-poll() after netif_rx_complete().  */
-static inline int netif_rx_reschedule(struct napi_struct *n)
-{
-   if (napi_schedule_prep(n)) {
-   __napi_schedule(n);
-   return 1;
-   }
-   return 0;
-}
-
 /* Test if receive needs to be scheduled but only if up */
 static inline int netif_rx_schedule_prep(struct net_device *dev,
 struct napi_struct *napi)
@@ -1234,6 +1224,17 @@ static inline void netif_rx_schedule(struct net_device 
*dev,
__netif_rx_schedule(dev, napi);
 }
 
+/* Try to reschedule poll. Called by dev-poll() after netif_rx_complete().  */
+static inline int netif_rx_reschedule(struct net_device *dev,
+ struct napi_struct *napi)
+{
+   if (napi_schedule_prep(napi)) {
+   __netif_rx_schedule(dev, napi);
+   return 1;
+   }
+   return 0;
+}
+
 /* same as netif_rx_complete, except that local_irq_save(flags)
  * has already been issued
  */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.24] Fix refcounting problem with netif_rx_reschedule()

2007-09-18 Thread Roland Dreier
  Further complicating things is that you need to setup a ppc32
  cross-build environment to even build test a conversion, and I'm not
  comfortable doing the surgery until I can test build the thing.

OK, I actually have a system with a ppc 440 SoC that uses this driver,
so I'll try to get things to the stage where I can boot net-2.6.24 on
it and see if I can get the driver working...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] IPoIB: Fix unregister_netdev hang

2007-09-18 Thread Roland Dreier
   and I think that is only in Dave's net-2.6.24 tree now, right?
  
  Nope, that was what I downloaded yesterday:
  
  VERSION = 2
  PATCHLEVEL = 6
  SUBLEVEL = 23
  EXTRAVERSION =-rc6
  NAME = Pink Farting Weasel

Please double check your tree.  I just very carefully looked at my
trees, and the poll_more: label is added in commit 6b460a71 ([NET]:
Make NAPI polling independent of struct net_device objects.) which is
only in the net-2.6.24 tree.  Of course Dave did not change the
version information in the Makefile since he wouldn't want Linus to
pick up any extra strange changes when he pulls, so a net-2.6.24 tree
will look like 2.6.23-rc6 as you quoted.

And the refcounting bug I fixed is only in net-2.6.24.

  To be clear, netif_rx_schedule while we are still in the poll list will not
  do any harm as it does nothing since NAPI_STATE_SCHED is still set (cleared
  by netif_rx_complete which has not yet run). Effectively we lost/delayed
  processing an interrupt, if I understood the code right.

Right, we lose an interrupt, and since the CQ events are one-shot, we
never get another one, and the interface is effectively dead.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.24

2007-09-17 Thread Roland Dreier
   IPoIB CM handles this properly by gathering together single pages in
   skbs' fragment lists.

  Then can we reuse IPoIB CM code here?

Yes, if possible, refactoring things so that the rx skb allocation
code becomes common between CM and non-CM would definitely make sense.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-17 Thread Roland Dreier
  The IGMP enabling patch posted by me on September 2nd isn't on your list
  http://lists.openfabrics.org/pipermail/general/2007-September/040250.html
  can you add it?

Yes, I lost that somehow.  I will add it to my list of things to take
a look at (no opinion yet).

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/11] IB/ipoib: Export call to call_netdevice_notifiers and add new private flag

2007-09-17 Thread Roland Dreier
I tried to look at the ipoib stuff in this series... this patch looks
fine but it doesn't actually touch ipoib, so the subject line is a bit
misleading...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] IB/ipoib: Verify address handle validity on send

2007-09-17 Thread Roland Dreier
Looks fine overall, with one minor nitpick:

  -if (unlikely(memcmp(neigh-dgid.raw,
  +if (unlikely((memcmp(neigh-dgid.raw,
   skb-dst-neighbour-ha + 4,
  -sizeof(union ib_gid {
  +sizeof(union ib_gid))) ||
  + (neigh-dev != dev))) {

the indentation here makes this confusing to read -- I would just do:

} else if (neigh-ah) {
if (unlikely(memcmp(neigh-dgid.raw,
skb-dst-neighbour-ha + 4,
-   sizeof(union ib_gid {
+   sizeof(union ib_gid)) ||
+neigh-dev != dev)) {
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister

2007-09-17 Thread Roland Dreier
Actually, thinking about this some more... would it be cleaner to more
the knowledge about bonding out of the ipoib driver?  in other words,
export something similar to

  +static int ipoib_slave_detach(struct net_device *dev)
  +{
  +int ret = 0;
  +if (dev-flags  IFF_SLAVE) {
  +dev-priv_flags |= IFF_SLAVE_DETACH;
  +rtnl_lock();
  +ret = call_netdevice_notifiers(NETDEV_CHANGE, dev);
  +rtnl_unlock();
  +}
  +return ret;
  +}

for drivers to use, rather than putting use of IFF_SLAVE and
IFF_SLAVE_DETACH outside of the bonding driver.

Also it seems this function could return void, since both call sites
ignore the return value and I don't see anything sensible that IPoIB
could do with the notifier chain return value anyway.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister

2007-09-17 Thread Roland Dreier
OK with me.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] IB/ipoib: Bound the net device to the ipoib_neigh structue

2007-09-17 Thread Roland Dreier
Overall idea looks good... one comment:

  +if (n-dev-flags  IFF_MASTER) {
  +/* n-dev is not an IPoIB device and we have
  +to take priv from elsewhere */
  +neigh = *to_ipoib_neigh(n);
  +if (neigh) {
  +priv = netdev_priv(neigh-dev);
  +ipoib_dbg(priv, neigh_destructor for bonding device: 
  %s\n,
  +  n-dev-name);
  +} else
  +return;
  +}

seems like it would be cleaner not to worry about bonding here and
just use neigh-dev all the time rather than having two ways to look
up the device.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] IB/ipoib: Notify the world before doing unregister

2007-09-17 Thread Roland Dreier
   Conceptually, I see your point and I'm ok with doing it either
  way.  My only question is, would this change would make the ipoib module
  dependent upon having the bonding module loaded (to resolve all of the
  symbols)?

Yes, I guess so, if that function is in bonding.  Hmm, that wouldn't
be a good change.

Maybe this new notification function should be in net/core/dev.c
instead of exporting call_netdevice_notifiers()?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.

2007-09-14 Thread Roland Dreier
  Maybe you should automatically create an alias each time new interface
  is added so that admin would not care about proper aliases?

I agree that makes much more sense from a user interface point of
view.  Unfortunately an alias without an address doesn't make sense,
so there doesn't seem to be a way to implement that.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.24

2007-09-14 Thread Roland Dreier
  The patch is just needed to pick up broadcast MTU size instead of hard
  coding 2K right now. SKB allocation shouldn't be different with Ethernet
  Jambo Frame and IPoIB-CM which 64K MTU. I don't understand why it's
  different. Could you please explain this?

It's exactly the same problem as ethernet jumbo frames.  A web search
for 'order 1 failure e1000' might be interesting.

IPoIB CM handles this properly by gathering together single pages in
skbs' fragment lists.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-14 Thread Roland Dreier
   I've been meaning to track down the bnx2 iscsi offload patch to look
   and see if this issue is addressed, since the same problem seems to
   exist: it seems an iscsi connection and a main stack tcp connection
   might share the same 4-tuple unless something is done to avoid that
   happening.

  iSCSI does not do passive listens, only active connections to the
  target.  But you're right, the port space is still shared between iSCSI
  and the main stack.  We currently rely on user apps binding to the main
  stack to reserve certain ephemeral ports, and telling the iSCSI driver
  which ports to use.

Got it... I wasn't thinking that clearly, but it is clear that a full
4-tuple collision with only active connections is quite unlikely.  I
guess you would have to make both an offloaded and a non-offloaded
iSCSI connection to the same target and get really unlucky with
ephemeral port allocation.  So in practice I guess it's not an issue
at all with your driver yet.

However, do you have any plans to support iSCSI offload for targets?
Also, looking at the first CNIC patch, I can't help but notice that
you seem to have at least some support for iWARP there.  How does the
CNIC look?  Does it share the same interface/addresses as the
non-offload NIC, or does it create a completely separate netdevice?

I want to make sure that whatever solution we come up with for cxgb3
doesn't cause problems for you.  And of course if you have a better
idea than what Steve has come up with, that would be great :)

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Roland Dreier
 support.  Would be nice to see a review
   from someone at Mellanox.

 - Arthur's mthca doorbell alignment fixes.  I will experiment with a
   few different approaches and post what I like (and fix mlx4 as
   well).  I hope Arthur can review.

 - Michael's mlx4 WQE shrinking patch.  Not sure yet; I'll reply to
   the latest patch directly.

Here are a few topics that I believe will not be ready in time for the
2.6.24 window and will need to wait for 2.6.25:

 - Multiple CQ event vector support.  I haven't seen any discussions
   about how ULPs or userspace apps should decide which vector to use,
   and hence no progress has been made since we deferred this during
   the 2.6.23 merge window.

 - XRC.  Given the length of the backlog above and the fact that a
   first draft of this code has not been posted yet, I don't see any
   way that we could have something this major ready in time.

Here is the complete list of patches I have in my for-2.6.24 branch
waiting for the merge window so far.  Mostly I haven't merged anything
big out of my backlog, so this is essentially all

Ali Ayoub (1):
  IB/sa: Error handling thinko fix

Anton Blanchard (3):
  IB/fmr_pool: Clean up some error messages in fmr_pool.c
  IB/ehca: Make output clearer by removing some debug messages
  IB/ehca: Export module parameters in sysfs

Dotan Barak (1):
  mlx4_core: Use enum value GO_BIT_TIMEOUT_MSECS

Eli Cohen (2):
  IPoIB: Fix typo to end statement with ';' instead of ','
  IPoIB: Fix error path memory leak

Michael S. Tsirkin (2):
  mlx4_core: Enable MSI-X by default
  IB/mthca: Enable MSI-X by default

Peter Oruba (1):
  IB/mthca: Use PCI-X/PCI-Express read control interfaces

Roland Dreier (6):
  IPoIB: Make sure no receives are handled when stopping device
  IB: find_first_zero_bit() takes unsigned pointer
  mlx4_core: Don't free special QPs in QP number bitmap
  IB/mlx4: Use set_data_seg() in mlx4_ib_post_recv()
  IB/ehca: Include linux/mutex.h from ehca_classes.h
  IB/mlx4: Fix up SRQ limit_watermark endianness

Steve Wise (1):
  RDMA/cxgb3: Make the iw_cxgb3 module parameters writable
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Roland Dreier
  Since ehca can support 4K MTU, we would like to see a patch in 
  IPoIB to allow link MTU to be up to 4K instead of current 2K for 2.6.24 
  kernel. The idea is IPoIB link MTU will pick up a return value from SM's 
  default broadcast MTU. This patch should be a small patch, I hope you are 
  OK with this.

It's actually not small, since it turns the skb allocation into a
4100-byte buffer, which ends up being more than 1 page usually, which
means it fails if memory is fragmented.

Anyway given the backlog anything substantial that hasn't been posted
already is almost surely going to have to wait until 2.6.25.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Roland Dreier
   - My user_mad P_Key index support patch.  I'll test the ioctl to
 change to the new mode and merge this I guess, since Hal and Sean
 have tested this out.
  
  I can give this patch a reviewed-by: too, and I will also try to review a 
  couple
  of the pending ipoib patches.

Thanks!

   - Sean's QoS changes.  These look fine at first glance, and I just
 plan to understand the backwards compatibility story (ie how this
 works with an old SM) and merge.  Anyone who objects let me know.
  
  The new QoS fields fall into fields that are currently reserved, which 
  should be
  ignored by an older SM.  I've only tested this against openSM however.

That seems OK -- I'm OK with breaking things if an SM is clearly buggy
(and not ignoring fields that are defined to be ignored in the spec
would certainly be a clear bug to me).

  This patch was generated in response to an Intel MPI issue.  We've seen MPI 
  take
  several minutes to respond to a connection request during the middle of large
  application runs.  When this happens, the active side times out the 
  connection.
  In OFED, we added module parameters to adjust the rdma_cm connection timeout 
  on
  the active side, but I believe that sending an MRA from the passive side is a
  better solution.

OK -- just to make sure I'm understanding what you're saying: have you
confirmed that your proposed patches actually fix the issue?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Roland Dreier
  I was about to post v2 of my patch to avoid port space collisions with
  the native stack.  Can we get that 2.6.24?  It is high priority
  IMO. I've tried to solicit review on it, but I think folks are
  reluctant... ;-)

I would like to get this in, but I'm still at least a little
reluctant, since we would be committing to a user interface that seems
a little awkward at best, so I'd like to try and find something
better.  Just to summarize my understanding:

 - your patch requires the administration to configure an ethX:iwY
   alias address to use iwarp.  (By the way is there anything other
   than don't do that that avoids assigning the same address to the
   iwarp alias and a non-iwarp interface?)

 - it would be nicer to create the alias automatically, but an alias
   without an address doesn't make sense.  Creating a whole separate
   net device causes problems because the iwarp stuff still needs to
   use the main net device to do ARP etc.

 - so I'm out of better ideas but I still want to push back a little
   before we commit to something ugly.

I've been meaning to track down the bnx2 iscsi offload patch to look
and see if this issue is addressed, since the same problem seems to
exist: it seems an iscsi connection and a main stack tcp connection
might share the same 4-tuple unless something is done to avoid that
happening.

Also, I think it behooves us to get some agreement on this approach
with NetEffect and Kanoj (NetXen?) at least, since their iwarp drivers
seem to be imminent.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Roland Dreier
  Well, if it involves /sharing/ port space with the native stack,
  i.e. where port 1234 is IB but 1235 is Linux, pretty much all the
  networking devs have NAK'd that approach AFAICS.

Just to be clear, InfiniBand has no problem; the issue is port
collisions involving iWARP connections.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts with the host stack.

2007-09-05 Thread Roland Dreier
   What's wrong with my suggestion of having the iwarp driver create an
   iwX interface to go with the normal ethX interface?  It seems
   simpler to me, and there's a somewhat similar precedent with how
   mac80211 devices create both wlan0 and wmaster0 interfaces.
- R.
  
  It seemed much more painful for me to implement. :-)
  
  I'll look into this, but I think for this to be done, the changes must
  be in the cxgb3 driver, not the rdma driver, because the guts of the
  netdev struct are all private to cxgb3.  Remember that this interface
  needs to still do non TCP traffic (like ARP and UDP)...
  
  Maybe you have something in mind here that I'm not thinking about?

No, I was just spouting off.

But the whole create a magic alias seems kind of unfriendly to the
user.  Maybe as you said, the cxgb3 net driver could create the alias
for the iw_cxgb3 driver?

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts with the host stack.

2007-08-30 Thread Roland Dreier
  The sysadmin creates for iwarp use only alias interfaces of the form
  devname:iw* where devname is the native interface name (eg eth0) for the
  iwarp netdev device.  The alias label can be anything starting with iw.
  The iw immediately after the ':' is the key used by the iwarp driver.

What's wrong with my suggestion of having the iwarp driver create an
iwX interface to go with the normal ethX interface?  It seems
simpler to me, and there's a somewhat similar precedent with how
mac80211 devices create both wlan0 and wmaster0 interfaces.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-28 Thread Roland Dreier
Sorry for the long latency, I was at the beach all last week.

   And direct data placement really does give you a factor of two at
   least, because otherwise you're stuck receiving the data in one
   buffer, looking at some of the data at least, and then figuring out
   where to copy it.  And memory bandwidth is if anything becoming more
   valuable; maybe LRO + header splitting + page remapping tricks can get
   you somewhere but as NCPUS grows then it seems the TLB shootdown cost
   of page flipping is only going to get worse.

  As Herbert has said already, people can code for this just like
  they have to code for RDMA.

No argument, you need to change the interface to take advantage of RDMA.

  There is no fundamental difference from converting an application
  to sendfile or similar.

Yes, on the transmit side, there's not much difference from sendfile
or splice, although RDMA may give a slightly nicer interface that also
gives basically the equivalent of AIO.

  The only thing this needs is a
  recvmsg_I_dont_care_where_the_data_is() call.  There are no alignment
  issues unless you are trying to push this data directly into the
  page cache.

I don't understand how this gives you the same thing as direct data
placement (DDP).  There are many situations where the sender knows
where the data has to go and if there's some way to pass that to the
receiver, so that info can be used in the receive path to put the data
in the right place, the receiver can save a copy.  This is
fundamentally the same offload that an FC HBA does -- the SCSI
midlayer queues up commands like read block A and put the data at
address X and read block B and put the data at address Y and the
HBA matches tags on incoming data to put the blocks at the right
addresses, even if block B is received before block A.

RFC 4297 has some discussion of the various approaches, and while you
might not agree with their conclusions, it is interesting reading.

  Couple this with a card that makes sure that on a per-page basis, only
  data for a particular flow (or group of flows) will accumulate.

It seems that the NIC would also have to look into a TCP stream (and
handle out of order segments etc) to find message boundaries for this
to be equivalent to what an RDMA NIC does.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.23 RESEND] cxgb3 - Fix dev-priv usage

2007-08-28 Thread Roland Dreier
  I take that back.  Rejected -- it breaks infiniband build.

To be more precise:

drivers/infiniband/hw/cxgb3/cxio_hal.c: In function 'cxio_rdev_open':
drivers/infiniband/hw/cxgb3/cxio_hal.c:919: error: implicit declaration of 
function 'T3CDEV'

it seems the problem is that T3CDEV() has been deleted and been
replaced with the dev2t3cdev() inline function.  However a simple
replacement s/T3CDEV/dev2t3cdev/ in drivers/infiniband/hw/cxgb3
doesn't work because the function has moved from t3cdev.h to
adapter.h; and moving the function back to t3cdev.h doesn't work
because it depends on more structure definitions now.

And at that point I gave up...

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.23 0/2] cxgb3 - Fix dev-priv usage

2007-08-28 Thread Roland Dreier
Looks OK to me but I would just roll up the second patch into the
first patch and let Jeff merge it as one commit.  There's no point in
creating an intermediate tree that doesn't build -- it just breaks git
bisect for no useful purpose.

Also as a side note, when trying to test this I got the message

could not load TP SRAM: unable to load t3a_protocol_sram-1.0.44.bin

and you guys seem to only have t3b protocol sram images on your web
site.  Could you send me the t3a file (or swap out my T3A boards for
T3B boards ;)?

Thanks,
  Roland
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-20 Thread Roland Dreier
[TSO / LRO discussion snipped -- it's not the main point so no sense
spending energy arguing about it]

  Just be realistic and accept that RDMA is a point in time solution,
  and like any other such technology takes flexibility away from users.
  
  Horizontal scaling of cpus up to huge arity cores, network devices
  using large numbers of transmit and receive queues and classification
  based queue selection, are all going to work to make things like RDMA
  even more irrelevant than they already are.

To me there is a real fundamental difference between RDMA and
traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that
messages can carry the address where they're supposed to be
delivered (what the IETF calls direct data placement).  And on top
of that you can build one-sided operations aka put/get aka RDMA.

And direct data placement really does give you a factor of two at
least, because otherwise you're stuck receiving the data in one
buffer, looking at some of the data at least, and then figuring out
where to copy it.  And memory bandwidth is if anything becoming more
valuable; maybe LRO + header splitting + page remapping tricks can get
you somewhere but as NCPUS grows then it seems the TLB shootdown cost
of page flipping is only going to get worse.

Don't get too hung up on the fact that current iWARP (RDMA over IP)
implementations are using TCP offload -- to me that is just a side
effect of doing enough processing on the NIC side of the PCI bus to be
able to do direct data placement.  InfiniBand with competely different
transport, link and physical layers is one way to implement RDMA
without TCP offload and I'm sure there will be others -- eg Intel's
IOAT stuff could probably evolve to the point where you could
implement iWARP with software TCP and the data placement offloaded to
some DMA engine.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-17 Thread Roland Dreier
   Isn't RDMA _part_ of the software net stack within Linux?

  It very much is not so.

This is just nit-picking.  You can draw the boundary of the software
net stack wherever you want, but I think Sean's point was just that
RDMA drivers already are part of Linux, and we all want them to get
better.

  When using RDMA you lose the capability to do packet shaping,
  classification, and all the other wonderful networking facilities
  you've grown to love and use over the years.

Same thing with TSO and LRO and who knows what else.  I know you're
going to make a distinction between stateless and stateful
offloads, but really it's just an arbitrary distinction between things
you like and things you don't.

  Imagine if you didn't know any of this, you purchase and begin to
  deploy a huge piece of RDMA infrastructure, you then get the mandate
  from IT that you need to add firewalling on the RDMA connections at
  the host level, and oh shit you can't?

It's ironic that you bring up firewalling.  I've had vendors of iWARP
hardware tell me they would *love* to work with the community to make
firewalling work better for RDMA connections.  But instead we get the
catch-22 of your changing arguments -- first, you won't even consider
changes that might help RDMA work better in the name of
maintainability; then you have to protect poor, ignorant users from
accidentally using RDMA because of some problem or another; and then
when someone tries to fix some of the problems you mention, it's back
to step one.

Obviously some decisions have been prejudged here, so I guess this
moves to the realm of politics.  I have plenty of interesting
technical stuff, so I'll leave it to the people with a horse in the
race to find ways to twist your arm.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >