Re: [PATCH net-next 4/6] net: hip04: Don't free the tx skb when the buffer is not ready for xmit
On 2015/4/15 22:19, Arnd Bergmann wrote: > On Wednesday 15 April 2015 20:30:06 Ding Tianhong wrote: >> @@ -489,6 +487,8 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, >> struct net_device *ndev) >> >> /* Ensure tx_head update visible to tx reclaim */ >> smp_wmb(); >> + count++; >> + priv->tx_head = TX_NEXT(tx_head); >> >> > > Something is wrong here, the comment does not match the code any more, because > the smp_wmb() won't actually make the tx_head update visible. > > Arnd > Yes, the smp_wmb() could only make sure the tx buffer was ready to xmit. > . > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc
On 2015/4/15 22:13, Arnd Bergmann wrote: > On Wednesday 15 April 2015 20:30:04 Ding Tianhong wrote: >> The hip04 ethernet use the big endian for tx and rx, so set desc to >> big endian and remove the unused next_addr. > > I don't understand: > >> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c >> b/drivers/net/ethernet/hisilicon/hip04_eth.c >> index b0a7f03..6473462 100644 >> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c >> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c >> @@ -132,19 +132,18 @@ >> #define HIP04_MIN_TX_COALESCE_FRAMES 1 >> >> struct tx_desc { >> - u32 send_addr; >> - u32 send_size; >> - u32 next_addr; >> - u32 cfg; >> - u32 wb_addr; >> + __be32 send_addr; >> + __be32 send_size; >> + __be32 cfg; >> + __be32 wb_addr; >> } __aligned(64); > > I would think this is a hardware structure, does this not break > access to the cfg and wb_addr fields if you remove next_addr? > > Arnd > good cache, I make a mistake here, thanks. > . > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/6] net: hip04: fix some problem for hip04 drivers
On 2015/4/15 22:25, Arnd Bergmann wrote: > On Wednesday 15 April 2015 20:30:02 Ding Tianhong wrote: >> The patches series was used to fix the issues of the hip04 driver, and added >> some optimizations according to some good suggestion. >> >> > > Thanks, that looks much better, except for patch 4 that I commented on. > I will check and fix it. > I had at some point sent a patch that is archived at > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-January/318120.html > > I believe that one is also still needed, but I have not tested whether it > is correct. Can you have a look at the patch from back then and see if it > works, of if you find something wrong about it? > > I'm sending the unmodified patch from then here again for you to apply > or comment. It will have to be rebased on top of your current changes. > > Arnd > Thanks for the suggestion, I notices that I miss this patch in my series from the maillist, I need time to check the code and try to test it, thanks for the work.:) Ding > 8< > Subject: [PATCH] net/hip04: refactor interrupt masking > > The hip04 ethernet driver currently acknowledges all interrupts directly > in the interrupt handler, and leaves all interrupts except the RX data > enabled the whole time. This causes multiple problems: > > - When more packets come in between the original interrupt and the > NAPI poll function, we will get an extraneous RX interrupt as soon > as interrupts are enabled again. > > - The two error interrupts are by definition combining all errors that > may have happened since the last time they were handled, but just > acknowledging the irq without dealing with the cause of the condition > makes it come back immediately. In particular, when NAPI is intentionally > stalling the rx queue, this causes a storm of "rx dropped" messages. > > - The access to the 'reg_inten' field in hip_priv is used for serializing > access, but is in fact racy itself. > > To deal with these issues, the driver is changed to only acknowledge > the IRQ at the point when it is being handled. The RX interrupts now get > acked right before reading the number of received frames to keep spurious > interrupts to the absolute minimum without losing interrupts. > > As there is no tx-complete interrupt, only an informational tx-dropped > one, we now ack that in the tx reclaim handler, hoping that clearing > the descriptors has also eliminated the error condition. > > The only place that reads the reg_inten now just relies on > napi_schedule_prep() to return whether NAPI is active or not, and > the poll() function is then going to ack and reenabled the IRQ if > necessary. > > Finally, when we disable interrupts, they are now all disabled > together, in order to simplify the logic and to avoid getting > rx-dropped interrupts when NAPI is in control of the rx queue. > > Signed-off-by: Arnd Bergmann > > diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c > b/drivers/net/ethernet/hisilicon/hip04_eth.c > index 525214ef5984..83773247a368 100644 > --- a/drivers/net/ethernet/hisilicon/hip04_eth.c > +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c > @@ -56,6 +56,8 @@ > #define RCV_DROP BIT(7) > #define TX_DROPBIT(6) > #define DEF_INT_ERR(RCV_NOBUF | RCV_DROP | TX_DROP) > +#define DEF_INT_RX (RCV_INT | RCV_NOBUF | RCV_DROP) > +#define DEF_INT_TX (TX_DROP) > #define DEF_INT_MASK (RCV_INT | DEF_INT_ERR) > > /* TX descriptor config */ > @@ -154,7 +156,6 @@ struct hip04_priv { > unsigned int port; > unsigned int speed; > unsigned int duplex; > - unsigned int reg_inten; > > struct napi_struct napi; > struct net_device *ndev; > @@ -303,17 +304,15 @@ static void hip04_mac_enable(struct net_device *ndev) > val |= GE_RX_PORT_EN | GE_TX_PORT_EN; > writel_relaxed(val, priv->base + GE_PORT_EN); > > - /* clear rx int */ > - val = RCV_INT; > - writel_relaxed(val, priv->base + PPE_RINT); > + /* clear prior interrupts */ > + writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT); > > /* config recv int */ > val = GE_RX_INT_THRESHOLD | GE_RX_TIMEOUT; > writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_INT); > > /* enable interrupt */ > - priv->reg_inten = DEF_INT_MASK; > - writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN); > + writel_relaxed(DEF_INT_MASK, priv->base + PPE_INTEN); > } > > static void hip04_mac_disable(struct net_device *ndev) > @@ -322,8 +321,7 @@ static void hip04_mac_disable(struct net_device *ndev) > u32 val; > > /* disable int */ > - priv->reg_inten &= ~(DEF_INT_MASK); > - writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN); > + writel_relaxed(0, priv->base + PPE_INTEN); > > /* disable tx & rx */ >
Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
On 15.04, Jesper Dangaard Brouer wrote: > > Out of curiosity, what is actually the performance impact on all > > of this? We were just arguing on a different matter on two more > > instructions in the fast-path, here it's refactoring the whole > > function into several ones, I presume gcc won't inline it. > > Pablo asked me to performance test this change. Full test report below. > > The performance effect (of this patch) depend on the Gcc compiler > version. > > Two tests: > 1. IP-forwarding (unloaded netfilter modules) > 2. Early drop in iptables "raw" table > > With GCC 4.4.7, which does not inline the new functions > (__netif_receive_skb_ingress and __netif_receive_skb_finish) the > performance impact/regression is definitly measurable. > > With GCC 4.4.7: > 1. IP-forwarding: +25.18 ns (slower) (-27776 pps) > 2. Early-drop : +7.55 ns (slower) (-66577 pps) > > With GCC 4.9.1, the new functions gets inlined, thus the refactor > splitup of __netif_receive_skb_core() is basically "cancled". > Strangly there is a small improvement for forwarding, likely due to > some lucky assember reordering that give less icache/fetch-misses. > The early-drop improvement is below accuracy levels, can cannot be > trusted. > > With GCC 4.9.1: > 1. IP-forwarding: -10.05ns (faster) (+17532 pps) > 2. Early-drop : -1.54ns (faster) (+16216 pps) below accuracy levels > > I don't know what to conclude, as the result depend on the compiler > version... but these kind of change do affect performance, and should > be tested/measured. Thanks Jesper. This effect without inlinging was to be expected I guess. The interesting question would be a patch that uses nf_hook() without okfn callback, moves the ingress qdisc to register with the netfilter ingress hook and moves the TTL tracking of ing_filter() to the ingress qdisc, where it belongs. My expectation would be that this would result in an overall performance gain. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
On 16.04, Herbert Xu wrote: > On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote: > > > > Netfilter may change the contents of the packet, even change its size. > > It is *really* hard to do this while keeping the original fragments > > intact. > > Perhaps we should provide better helpers to facilitate this? > > So instead of directly manipulating the content of the skb you > would so so through helpers and the helpers can then try to do > sensible things with the fragments. Yeah, but its going to get pretty complicated. There can be multiple size changes and modifications for every packet, so we would need to keep track of the size modifications that have already occured to map to the correct position in the frag_list. The modifications would then have to be performed on both the reassembled skb since they might again be matched against later on and on the frag_list, potentially split over multiple skbs. When enlarging the packet, I guess we would insert a new (small) fragment to keep the geometry of the original fragments. In extreme cases like H.323 NAT this might result in a huge amount of new very small fragments. And it might still break the geometry when the size difference isn't representable as a multiple of 8. When reducing the skb size, we might have to choice but to change the geometry of at least on fragment and would have to shift a lot of data around. This is very complicated, it we really want to do this, its a lot easier to just keep note of the full original geometry and refragment to those exact sizes while potentially adding or removing things at the end. My personal opinion is, why bother. The only thing that cares about the fragment sizes is PMTUD, and that's what this patch is fixing in a much simpler fashion. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote: > > Netfilter may change the contents of the packet, even change its size. > It is *really* hard to do this while keeping the original fragments > intact. Perhaps we should provide better helpers to facilitate this? So instead of directly manipulating the content of the skb you would so so through helpers and the helpers can then try to do sensible things with the fragments. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
On 16.04, Herbert Xu wrote: > David Miller wrote: > > > > Because then there is no ambiguity at all, you preserve on output > > exactly what you had on input. The same geometry, the same > > everything. No special checks, no max frag len, none of this crap. > > Those are all hacks trying to work around the _fundamental_ issue > > which is that we potentially change the thing when we refrag. > > Agreed. Doing anything other than preserving the original geometry > is simply wrong. > > However, this doesn't mean that netfilter has to process each > fragment. What we could do is to preserve the original fragments > in frag_list and then process the overall skb as a unit in netfilter. > > On output we simply fragment according to the original frag_list. > > The only thing to watch out for is to eliminate anything in the > middle that tries to linearise the skb. Netfilter may change the contents of the packet, even change its size. It is *really* hard to do this while keeping the original fragments intact. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] neighbour.c: Avoid GC directly after state change
Hi, Ulf Samuelsson wrote: > The desired functionality is that if communication stops, > you want to send out ARP probes, before the entry is deleted. > > The current (pseudo) code of the neigh timer is: > > if (state & NUD_REACHABLE) { > if (now <= "confirmed + "reachable_time")) { > ... /* We are OK */ > } else if (now < "used" + DELAY_PROBE_TIME) {/* Never happens */ > state = NUD_DELAY; > } else { > state = NUD_STALE; > notify = 1; > } > > We never see the state beeing changed from REACHABLE to DELAY, > so the probes are not beeing sent out, instead you always go > from REACHABLE to STALE. That's right. > DELAY_PROBE_TIME is set to (5 x HZ) and "used" > seems to be only set by the periodic_work routine > when the neigh entry is in STALE state, and then it is too late. > It is also set by "arp_find" which is used by "broken" devices. > In STALE state, neigh->used is set by neigh_event_send(), called by neigh_resolve_output() via neigh->output(). > In practice, the second condition: "(now < "used" + DELAY_PROBE_TIME)" is > never used. > What is the intention of this test? That's right. It is NOT used in normal condition unless reachable time is too short. > > By adding a new test + parameter, we would get the desired functionality, > and no need to listen for notifications or doing ARP state updates from > applications. > > if (now <= "confirmed + "reachable_time")) { > ... /* We are OK */ > +else if (now <= "confirmed + "reprobe_time")) { > + state <= NUD_DELAY; > } else if (now < "used" + DELAY_PROBE_TIME))) {/* Never happens */ > state <= NUD_DELAY; > } else { > state = NUD_STALE; > notify = 1; > } > > This way the entry would remain in REACHABLE while normal communication > occurs, > then it would enter DELAY state to probe, and if that fails, it goes to STALE > state. No, it is not what REACHABLE and DELAY mean. >From RFC2461: | REACHABLE Roughly speaking, the neighbor is known to have been | reachable recently (within tens of seconds ago). : | STALE The neighbor is no longer known to be reachable but | until traffic is sent to the neighbor, no attempt | should be made to verify its reachability. | DELAY The neighbor is no longer known to be reachable, and | traffic has recently been sent to the neighbor. | Rather than probe the neighbor immediately, however, | delay sending probes for a short while in order to | give upper layer protocols a chance to provide | reachability confirmation. -- Hideaki Yoshifuji Technical Division, MIRACLE LINUX CORPORATION -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] dsa: mv88e6xxx: Fix error handling in mv88e6xxx_set_port_state
Return correct error code if _mv88e6xxx_reg_read returns an error. Fixes: facd95b2e0ec0 ("net: dsa: mv88e6xxx: Add Hardware bridging support") Signed-off-by: Guenter Roeck --- drivers/net/dsa/mv88e6xxx.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c index fc8d3b6ffe8e..f64186a5f634 100644 --- a/drivers/net/dsa/mv88e6xxx.c +++ b/drivers/net/dsa/mv88e6xxx.c @@ -908,8 +908,10 @@ static int mv88e6xxx_set_port_state(struct dsa_switch *ds, int port, u8 state) mutex_lock(&ps->smi_mutex); reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_CONTROL); - if (reg < 0) + if (reg < 0) { + ret = reg; goto abort; + } oldstate = reg & PORT_CONTROL_STATE_MASK; if (oldstate != state) { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
David Miller wrote: > > Because then there is no ambiguity at all, you preserve on output > exactly what you had on input. The same geometry, the same > everything. No special checks, no max frag len, none of this crap. > Those are all hacks trying to work around the _fundamental_ issue > which is that we potentially change the thing when we refrag. Agreed. Doing anything other than preserving the original geometry is simply wrong. However, this doesn't mean that netfilter has to process each fragment. What we could do is to preserve the original fragments in frag_list and then process the overall skb as a unit in netfilter. On output we simply fragment according to the original frag_list. The only thing to watch out for is to eliminate anything in the middle that tries to linearise the skb. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR
On Thu, Apr 16, 2015 at 01:58:16AM +0200, Luis R. Rodriguez wrote: > > An alternative... is to just ioremap_wc() the entire region, including > MMIO registers for these old devices. I see one ethernet driver that does > this, myri10ge, and am curious how and why they ended up deciding this > and if they have run into any issues. I wonder if this is a reasonable > comrpomise for these 2 remaining corner cases. > For myri10ge, it a performance thing. Descriptor rings are in NIC memory BAR0, not in host memory. Say, to send a packet, the driver writes the send descriptor to the ioremap'd NIC memory. It is a multi-word descriptor. So, to send it as one PCIE MWr transaction, the driver maps the whole BAR0 as WC and does "copy descriptor; wmb". Without WC, descriptors would end up as multiple 4B or 8B MWr packets to the NIC, which has a pretty big performance impact on this particular NIC. Most registers that do not want WC are actually in BAR2, which is not mapped as WC. For registers that are in BAR0, we do "write to the register; wmb". If we want to wait till the NIC has seen the write, we do "write; wmb; read". This approach has worked for this device for many years. I cannot say whether it works for other devices, though. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote: > Eric Dumazet wrote: > > > > We already have netdev->gso_max_size and netdev->gso_max_segs > > which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs > > It is quite dangerous to attempt tricks like this because a > tc redirection or netfilter nat could change the destination > device rendering such hints incorrect. Right but we are talking of performance hints, on quite basic VM setup. Here the guest would use xen and this hint would apply. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Eric Dumazet wrote: > > We already have netdev->gso_max_size and netdev->gso_max_segs > which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs It is quite dangerous to attempt tricks like this because a tc redirection or netfilter nat could change the destination device rendering such hints incorrect. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]
On 4/15/2015 3:54 AM, Ian Jackson wrote: Prashant writes ("Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]"): I tried to reproduce the problem on 32 bit 3.14.34 stable kernel baremetal, with iommu=soft swiotlb=force but no luck, no drops or errors. I did not try with Xen 64 bit yet. Btw I need a pcie analyzer trace to confirm the problem. Is it feasible to capture at your end ? In private correspondence with Prashant we have established that Prashant was using a different kernel configuration. Prashant provided me with their kernel and module binaries, which work in my environment. I have also established that I can reproduce the problem with 3.14.37 (`iommu=soft swiotlb=force' baremetal => all rx wholly corrupted). The kernel config I was using is here: http://logs.test-lab.xenproject.org/osstest/logs/50387/build-i386-pvops/build/config As far as I am aware no-one at Broadcom has attempted a build and test using my kernel config. Ian, using your config we are able to recreate the problem that you are seeing. The driver finds the RX data buffer to be all zero, with a analyzer trace we are seeing the chip is DMA'ing valid RX data buffer contents to the host but once the driver tries to read this DMA area, it is seeing all zero's which is the reason of the corruption. This is only for the RX data buffer, the RX descriptor and status block update DMA regions are having valid contents. This is unlikely to be a chip or driver issue, as the chip is doing the correct DMA but the corruption occurs before driver reads it. Would request iommu experts to take a look and suggest what can be done next. We are more than willing to try any changes in the driver, I have added few more team members who will work with you if needed. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR
On Thu, 2015-04-16 at 01:58 +0200, Luis R. Rodriguez wrote: > Hey Andy, thanks for your review, adding Hyong-Youb Kim for review of the > full range ioremap_wc() idea below. > > On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote: > > Hi All, > > > > On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote: > > > From the beginning it seems only framebuffer devices used MTRR/WC, > > [snip] > > > The ivtv device is a good example of the worst type of > > > situations and these days. So perhap __arch_phys_wc_add() and a > > > ioremap_ucminus() might be something more than transient unless hardware > > > folks > > > get a good memo or already know how to just Do The Right Thing (TM). > > > > Just to reiterate a subtle point, use of the ivtvfb is *optional*. A > > user may or may not load it. When the user does load the ivtvfb driver, > > the ivtv driver has already been initialized and may have functions of > > the card already in use by userspace. > > I suspected this and its why I note that a rewrite to address a clean > split with separate ioremap seems rather difficult in this case. > > > Hopefully no one is trying to use the OSD as framebuffer and the video > > decoder/output engine for video display at the same time. > > Worst case concern I have also is the implications of having overlapping > ioremap() calls (as proposed in my last reply) for different memory types > and having the different virtual memory addresse used by different parts > of the driver. Its not clear to me what the hardware implications of this > are. > > > But the video > > decoder/output device nodes may already be open for performing ioctl() > > functions so unmapping the decoder IO space out from under them, when > > loading the ivtvfb driver module, might not be a good thing. > > Using overlapping ioremap() calls with different memory types would address > this concern provided hardware won't barf both on the device and CPU. Hardware > folks could provide feedback or an ivtvfb user could test the patch supplied > on both non-PAT and PAT systems. Even so, who knows, this might work on some > systems while not on others, only hardware folks would know. The CX2341[56] firmware+hardware has a track record for being really picky about sytem hardware. It's primary symptoms are for the DMA engine or Mailbox protocol to get hung up. So yeah, it could barf easily on some users. > An alternative... is to just ioremap_wc() the entire region, including > MMIO registers for these old devices. That's my thought; as long as implementing PCI write then read can force writes to be posted and that setting that many pages as WC doesn't cause some sort of PAT resource exhaustion. (I know very little about PAT). > I see one ethernet driver that does > this, myri10ge, and am curious how and why they ended up deciding this > and if they have run into any issues. I wonder if this is a reasonable > comrpomise for these 2 remaining corner cases. > > Luis Regards, Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 03/14] i40e: Add support to program FDir SB rules for VF from PF through ethtool
On Wed, 2015-04-15 at 11:58 +0300, Or Gerlitz wrote: > On Wed, Apr 15, 2015 at 7:51 AM, Jeff Kirsher > wrote: > > From: Anjali Singhai Jain > > > With this patch we can now add Flow director Sideband rules for a VF > from > > it's PF. Here is an example on how it can be done when VF id = 5 and > > queue = 2: > > > > "ethtool -N ethx flow-type udp4 src-ip x.x.x.x dst-ip y.y.y.y > src-port p1 dst-port p2 action 2 user-def 5" > > > > User-def specifies VF id and action specifies queue. > > Hi Jeff, > > We're too eager to have support for PF controlled ACL / over-ruling VF > traffic. > > I don't think the way to go here is just go an use the user-defined > "data" field of struct ethtool_flow_ext to represent a VF -- since VF > is well defined multiple vendor supported construct. At least in the > command line would be better to see "vf 5" and not "user-def 5" -- and > maybe also down into the user-kernel API somehow enhance the ethtool > flow related structures such that VF ID is specified? Thanks for the feedback Or, I will work with Anajali and Jesse to get your suggestions implemented in follow-on patches. signature.asc Description: This is a digitally signed message part
Re: CodingStyle parenthesis alignment: was: Re: align to open paren
On Wed, 2015-04-15 at 21:54 +, Hubbe, Allen wrote: > I ran with the --fix option, and it changed every rejected indent to > match the column of the open paren. That's probably what you want, > since it's the most consistent with the previous behavior. The > difference is that it does not fix lines that are no longer rejected, > like if they are indented by two per paren. The test file and output > are attached. There might be some utility adding a --style= switch so checkpatch can appropriately warn on K&R vs linux vs bsd vs gnu vs otb, etc... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v3] skbuff: Do not scrub skb mark within the same name space
On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote: > Le 15/04/2015 15:57, Herbert Xu a écrit : > >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote: > [snip] > >Subject: skbuff: Do not scrub skb mark within the same name space > > > >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels: > Maybe add a Fixes tag? > Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path") > > >harmonize cleanup done on skb on rx path") broke anyone trying to > >use netfilter marking across IPv4 tunnels. While most of the > >fields that are cleared by skb_scrub_packet don't matter, the > >netfilter mark must be preserved. > > > >This patch rearranges skb_scurb_packet to preserve the mark field. > nit: s/scurb/scrub > > Else it's fine for me. Sure. PS I used the wrong email for James the first time around. So let me repeat the question here. Should secmark be preserved or cleared across tunnels within the same name space? In fact, do our security models even support name spaces? ---8<--- The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels: harmonize cleanup done on skb on rx path") broke anyone trying to use netfilter marking across IPv4 tunnels. While most of the fields that are cleared by skb_scrub_packet don't matter, the netfilter mark must be preserved. This patch rearranges skb_scrub_packet to preserve the mark field. Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path") Signed-off-by: Herbert Xu diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3b6e583..a185427 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4124,19 +4124,22 @@ EXPORT_SYMBOL(skb_try_coalesce); */ void skb_scrub_packet(struct sk_buff *skb, bool xnet) { - if (xnet) - skb_orphan(skb); skb->tstamp.tv64 = 0; skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb->ignore_df = 0; skb_dst_drop(skb); - skb->mark = 0; skb_sender_cpu_clear(skb); skb_init_secmark(skb); secpath_reset(skb); nf_reset(skb); nf_reset_trace(skb); + + if (!xnet) + return; + + skb_orphan(skb); + skb->mark = 0; } EXPORT_SYMBOL_GPL(skb_scrub_packet); -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR
Hey Andy, thanks for your review, adding Hyong-Youb Kim for review of the full range ioremap_wc() idea below. On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote: > Hi All, > > On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote: > > From the beginning it seems only framebuffer devices used MTRR/WC, > [snip] > > The ivtv device is a good example of the worst type of > > situations and these days. So perhap __arch_phys_wc_add() and a > > ioremap_ucminus() might be something more than transient unless hardware > > folks > > get a good memo or already know how to just Do The Right Thing (TM). > > Just to reiterate a subtle point, use of the ivtvfb is *optional*. A > user may or may not load it. When the user does load the ivtvfb driver, > the ivtv driver has already been initialized and may have functions of > the card already in use by userspace. I suspected this and its why I note that a rewrite to address a clean split with separate ioremap seems rather difficult in this case. > Hopefully no one is trying to use the OSD as framebuffer and the video > decoder/output engine for video display at the same time. Worst case concern I have also is the implications of having overlapping ioremap() calls (as proposed in my last reply) for different memory types and having the different virtual memory addresse used by different parts of the driver. Its not clear to me what the hardware implications of this are. > But the video > decoder/output device nodes may already be open for performing ioctl() > functions so unmapping the decoder IO space out from under them, when > loading the ivtvfb driver module, might not be a good thing. Using overlapping ioremap() calls with different memory types would address this concern provided hardware won't barf both on the device and CPU. Hardware folks could provide feedback or an ivtvfb user could test the patch supplied on both non-PAT and PAT systems. Even so, who knows, this might work on some systems while not on others, only hardware folks would know. An alternative... is to just ioremap_wc() the entire region, including MMIO registers for these old devices. I see one ethernet driver that does this, myri10ge, and am curious how and why they ended up deciding this and if they have run into any issues. I wonder if this is a reasonable comrpomise for these 2 remaining corner cases. Luis -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: make mandocs build failure with next-20150407
On Wed, Apr 8, 2015 at 7:41 AM, Jim Davis wrote: > DOCPROC Documentation/DocBook/80211.xml > Error(.//include/net/mac80211.h:329): Cannot parse enum! > Error(.//include/net/mac80211.h:367): Cannot parse enum! > Warning(.//include/net/mac80211.h:381): No description found for > parameter 'type' > -- This error start to appear on Linus's tree. make xmldocs stops after this error. Then run it again, make xmldocs process rest of the files. Add TO: netdev and patch authors for reporting the issue. Masanari > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] bpf: fix two bugs in verification logic when accessing 'ctx' pointer
1. first bug is a silly mistake. It broke tracing examples and prevented simple bpf programs from loading. In the following code: if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) { } else if (...) { // this part should have been executed when // insn->code == BPF_W and insn->imm != 0 } Obviously it's not doing that. So simple instructions like: r2 = *(u64 *)(r1 + 8) will be rejected. Note the comments in the code around these branches were and still valid and indicate the true intent. Replace it with: if (BPF_SIZE(insn->code) != BPF_W) continue; if (insn->imm == 0) { } else if (...) { // now this code will be executed when // insn->code == BPF_W and insn->imm != 0 } 2. second bug is more subtle. If malicious code is using the same dest register as source register, the checks designed to prevent the same instruction to be used with different pointer types will fail to trigger, since we were assigning src_reg_type when it was already overwritten by check_mem_access(). The fix is trivial. Just move line: src_reg_type = regs[insn->src_reg].type; before check_mem_access(). Add new 'access skb fields bad4' test to check this case. Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields") Signed-off-by: Alexei Starovoitov --- I would love to add the testcase for bug#1 as well, but it needs bigger refactoring of test_verifier, so will do it after net-next reopens. kernel/bpf/verifier.c |9 +++-- samples/bpf/test_verifier.c | 22 ++ 2 files changed, 29 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 66bec36ec1ec..47dcd3aa6e23 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1637,6 +1637,8 @@ static int do_check(struct verifier_env *env) if (err) return err; + src_reg_type = regs[insn->src_reg].type; + /* check that memory (src_reg + off) is readable, * the state of dst_reg will be updated by this func */ @@ -1646,9 +1648,12 @@ static int do_check(struct verifier_env *env) if (err) return err; - src_reg_type = regs[insn->src_reg].type; + if (BPF_SIZE(insn->code) != BPF_W) { + insn_idx++; + continue; + } - if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) { + if (insn->imm == 0) { /* saw a valid insn * dst_reg = *(u32 *)(src_reg + off) * use reserved 'imm' field to mark this insn diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c index 9ab645698ffb..12f3780af73f 100644 --- a/samples/bpf/test_verifier.c +++ b/samples/bpf/test_verifier.c @@ -721,6 +721,28 @@ static struct bpf_test tests[] = { .errstr = "different pointers", .result = REJECT, }, + { + "access skb fields bad4", + .insns = { + BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 3), + BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, + offsetof(struct __sk_buff, len)), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_0), + BPF_JMP_IMM(BPF_JA, 0, 0, -13), + }, + .fixup = {7}, + .errstr = "different pointers", + .result = REJECT, + }, }; static int probe_filter_length(struct bpf_insn *fp) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] TCP stack performance decrease since 3.14
On Wed, 2015-04-15 at 17:38 -0400, rapier wrote: > I believe I properly disabled CPU power management in the bios (the > lenovo bios isn't terribly clear on this). I then booted with > processor.max_cstate=1 idle=poll (also tried with > intel_idle.max_cstate=0 and combinatiosn thereof). Still seeing reduced > performance in comparison to 3.14. I'll try using /dev/cpu_dma_latency > instead when I get in tomorrow. If you have other suggestions to verify > c-state I'd be happy to hear them. > > As a note, 3.2 tests as being more than 18% faster in the above categories. Wait a minute, are you testing tcp on a single laptop over loopback ? multiple netperf consume a lot of ram and any change in kernel vmlinux size can impact the performance simply because your cpu cache is too small to cope with the increase. Between 3.2 and 4.0 kernel, TCP networking changes are maybe 2 % of overall changes. 2.6.1 kernel in 32bit mode will be much faster you know. Maybe 40% faster. # ls -l /boot/vmlinuz-* -rwxr-xr-x 1 root root 3719696 2015-04-15 14:09 /boot/vmlinuz-3.14.0-smp -rwxr-xr-x 1 root root 3911520 2015-04-15 14:30 /boot/vmlinuz-4.0.0-smp -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: CodingStyle parenthesis alignment: was: Re: align to open paren
> -Original Message- > From: Joe Perches [mailto:j...@perches.com] > Sent: Wednesday, April 15, 2015 5:07 PM > To: Hubbe, Allen > Cc: a...@canonical.com; LKML; netdev > Subject: CodingStyle parenthesis alignment: was: Re: align to open paren > > On Wed, 2015-04-15 at 20:34 +, Hubbe, Allen wrote: > > Hello Andy, Joe, > > Hello Allen. > > As this is a discussion better suited for linux > development lists I've cc'd LKML and netdev. > > > I would like to find the origin of the decision to align to the open > > paren in Linux. > > Mostly it's a style decision shared by net/ and > drivers/net/ and a few other directories. > > It's a checkpatch --strict only test so it's not on > by default except in net/ and drivers/net/. > Thanks. > > I found the commit where this was added a few years ago, d1fe9c0. > > The email thread just says the style should be that way, but not why > > or how the decision was made. The how and the why is what I'm after, > > since I have a critique of the chosen style. > > > > I realize there is a lot of code written using this stile, and > > changing it would be disruptive. I certainly would not ask for that. > > > > Indenting to the open paren might cause ambiguous indentation between > > the parenthesized expression and the next logical thing. In the > > following, next_thing aligns to the same column as baz, even though > > baz is part of the condition expression, and next_thing is the > > continued statement. > > > > = if (foo(bar, > > = baz)) > > = next_thing(); > > > > It would be necessary to reindent to maintain style, even though the > > code of the next lines is the same. This has consequences like > > changing the blame, for instance. In the following, 4 + 5 is the bug, > > but whoever replaced the global with an instance variable gets the > > blame. > > blame is overrated. > git blame -w ignores most of the whitespace noise. > > > = global_variable = foo(bar, > > = baz(1 + 3), > > = baz(4 + 5) + 6); > > with > > = obj->var = foo(bar, > > =baz(1 + 3), > > =baz(4 + 5) + 6); > > > > I'm used to the default in many editors, which is to indent twice > > following each open paren, as opposed to once following each open > > brace or continued statement. It is a simpler rule for automatic > > formatting to implement. It also avoids mixing tabs and spaces, the > > spacing is unambiguous, and to maintain style, there is no need to > > re-indent lines following an edit if the position of the open paren > > changes. > > > > It's interesting to me that this style is only enforced by > > checkpatch.pl --strict. It is not in Documents/CodingStyle. That > > being the case, would it be acceptable to relax the rule in > > checkpatch.pl to accept either style? If not, well, I'll just accept > > the chosen style and fix my code. > > I personally don't care much either way. > > > If the following looks acceptable to you, I'll submit the patch to the > > list. > > The patch most likely wouldn't be appropriate for > net/ and drivers/net/ where the developers seem to > strongly prefer alignment to open parenthesis. > I don't want to annoy the net developers. Is it that the style is the same across the kernel, just more strongly enforced in net, or do different subsystem maintainers enforce different styles? Again, I'll be happy to just accept the kernel style if that's the case. > Also I think the open parenthesis count isn't right > as it would ask for multiple indents for code like: > > if ((foo(bar)) && > (baz(bar))) { As this is properly indented to the open paren, it is accepted. > > I think checkpatch would now want: > > if ((foo(bar)) && > (baz(bar))) { > CHECK: Alignment should match open parenthesis, or be twice indented for each open parenthesis #51: FILE: foo.c:51: + if ((foo(bar)) && + (baz(bar))) { I count three opens and two closes in the first line, leaving one open paren, but the second line is further indented by four instead of two. That is rejected, but it accepts this: if ((foo(bar)) && (baz(bar))) { > and the --fix option would be wrong too. I ran with the --fix option, and it changed every rejected indent to match the column of the open paren. That's probably what you want, since it's the most consistent with the previous behavior. The difference is that it does not fix lines that are no longer rejected, like if they are indented by two per paren. The test file and output are attached. > > cheers, Joe > > > Thanks, > > Allen > > > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl > > index d124359..8e49125 100755 > > --- a/scripts/checkpatch.pl > > +++ b/scripts/checkpatch.pl > > @@ -1834,6 +1834,15 @@ sub pos_last_openparen { > > return length(expand_tabs(substr($lin
Re: [net-next 00/14][pull request] Intel Wired LAN Driver Updates 2015-04-14
From: Jeff Kirsher Date: Tue, 14 Apr 2015 21:51:37 -0700 > This series contains updates to i40e and i40evf. Pulled, thanks Jeff. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] TCP stack performance decrease since 3.14
On 4/15/15 5:01 PM, Eric Dumazet wrote: On Wed, 2015-04-15 at 15:31 -0400, rapier wrote: All, First, my apologies if this came up previously but I couldn't find anything using a keyword search of the mailing list archive. As part of the on going work with web10g I need to come up with baseline TCP stack performance for various kernel revision. Using netperf and super_netperf* I've found that performance for TCP_CC, TCP_RR, and TCP_CRR has decreased since 3.14. 3.143.184.0 decrease % TCP_CC 183945 179222 175793 4.4% TCP_RR 594495 585484 561365 5.6% TCP_CRR 98677 96726 93026 5.7% Stream tests have remained the same from 3.14 through 4.0. All tests were conducted on the same platform from clean boot with stock kernels. So my questions are: Has anyone else seen this or is this a result of some weirdness on my system or artifact of my tests? If others have seen this or is just simply to be expected (from new features and the like) is it due to the TCP stack itself or other changes in the kernel? If so, is there anyway to mitigate the effect of this via stack tuning, kernel configuration, etc? Thanks! Chris * The above results are the average of 10 iterations of super_netperf for each test. I can run more iterations to verify the results but it seem consistent. The number of parallel processes for each test was tuned to produce the maximum test result. In other words, enough to push things but not enough to cause performance hits due to being cpu/memory/etc bound. If anyone wants the full results and test scripts just let me know. -- Make sure you do not hit a c-state issue. I've seen improvements in the stack translate to longer wait times, and cpu takes longer to exit deep c-state. I believe I properly disabled CPU power management in the bios (the lenovo bios isn't terribly clear on this). I then booted with processor.max_cstate=1 idle=poll (also tried with intel_idle.max_cstate=0 and combinatiosn thereof). Still seeing reduced performance in comparison to 3.14. I'll try using /dev/cpu_dma_latency instead when I get in tomorrow. If you have other suggestions to verify c-state I'd be happy to hear them. As a note, 3.2 tests as being more than 18% faster in the above categories. Chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Question] TCP stack performance decrease since 3.14
On 04/15/2015 12:31 PM, rapier wrote: All, First, my apologies if this came up previously but I couldn't find anything using a keyword search of the mailing list archive. As part of the on going work with web10g I need to come up with baseline TCP stack performance for various kernel revision. Using netperf and super_netperf* I've found that performance for TCP_CC, TCP_RR, and TCP_CRR has decreased since 3.14. 3.143.184.0 decrease % TCP_CC1839451792221757934.4% TCP_RR5944955854845613655.6% TCP_CRR9867796726930265.7% Stream tests have remained the same from 3.14 through 4.0. Have the service demands (usec of CPU consumed per KB) remained the same on the stream tests? Even then, stateless offloads can help hide a multitude of path-length sins. All tests were conducted on the same platform from clean boot with stock kernels. So my questions are: Has anyone else seen this or is this a result of some weirdness on my system or artifact of my tests? I've wondered if such a thing might be taking place but never had a chance to check. One thing you might consider is "perf" profiling to see how the CPU consumption break-down has changed. happy benchmarking, rick jones If others have seen this or is just simply to be expected (from new features and the like) is it due to the TCP stack itself or other changes in the kernel? If so, is there anyway to mitigate the effect of this via stack tuning, kernel configuration, etc? Thanks! Chris * The above results are the average of 10 iterations of super_netperf for each test. I can run more iterations to verify the results but it seem consistent. The number of parallel processes for each test was tuned to produce the maximum test result. In other words, enough to push things but not enough to cause performance hits due to being cpu/memory/etc bound. If anyone wants the full results and test scripts just let me know. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll
From: Eric Dumazet Date: Tue, 14 Apr 2015 18:45:00 -0700 > From: Eric Dumazet > > Commit 9a2620c877454 ("bnx2x: prevent WARN during driver unload") > switched the napi/busy_lock locking mechanism from spin_lock() into > spin_lock_bh(), breaking inter-operability with netconsole, as netpoll > disables interrupts prior to calling our napi mechanism. > > This switches the driver into using atomic assignments instead of the > spinlock mechanisms previously employed. > > Based on initial patch from Yuval Mintz & Ariel Elior > > I basically added softirq starvation avoidance, and mixture > of atomic operations, plain writes and barriers. > > Note this slightly reduces the overhead for this driver when no > busy_poll sockets are in use. > > Fixes: 9a2620c877454 ("bnx2x: prevent WARN during driver unload") > Signed-off-by: Eric Dumazet Applied and queued up for -stable, thanks Eric. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch v2] net: hip04: Make tx coalesce timer actually work
From: Thomas Gleixner Date: Tue, 14 Apr 2015 21:42:42 +0200 (CEST) > The code sets the expiry value of the timer to a relative value and > starts it with hrtimer_start_expires. That's fine, but that only works > once. The timer is started in relative mode, so the expiry value gets > overwritten with the absolut expiry time (now + expiry). > > So once the timer expired, a new call to hrtimer_start_expires results > in an immidiately expired timer, because the expiry value is > already in the past. > > Use the proper mechanisms to (re)start the timer in the intended way. > > Signed-off-by: Thomas Gleixner Applied, thanks Thomas. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured
From: Rafał Miłecki Date: Wed, 15 Apr 2015 22:21:49 +0200 > I guess the decisions depends on > a) time needed for David to revert fc300dc & send pull request > vs. > b) time needed for Ralf to send pull request Let's just wait for Ralf's stuff to hit Linus's tree. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CodingStyle parenthesis alignment: was: Re: align to open paren
On Wed, 2015-04-15 at 20:34 +, Hubbe, Allen wrote: > Hello Andy, Joe, Hello Allen. As this is a discussion better suited for linux development lists I've cc'd LKML and netdev. > I would like to find the origin of the decision to align to the open > paren in Linux. Mostly it's a style decision shared by net/ and drivers/net/ and a few other directories. It's a checkpatch --strict only test so it's not on by default except in net/ and drivers/net/. > I found the commit where this was added a few years ago, d1fe9c0. > The email thread just says the style should be that way, but not why > or how the decision was made. The how and the why is what I'm after, > since I have a critique of the chosen style. > > I realize there is a lot of code written using this stile, and > changing it would be disruptive. I certainly would not ask for that. > > Indenting to the open paren might cause ambiguous indentation between > the parenthesized expression and the next logical thing. In the > following, next_thing aligns to the same column as baz, even though > baz is part of the condition expression, and next_thing is the > continued statement. > > = if (foo(bar, > = baz)) > = next_thing(); > > It would be necessary to reindent to maintain style, even though the > code of the next lines is the same. This has consequences like > changing the blame, for instance. In the following, 4 + 5 is the bug, > but whoever replaced the global with an instance variable gets the > blame. blame is overrated. git blame -w ignores most of the whitespace noise. > = global_variable = foo(bar, > = baz(1 + 3), > = baz(4 + 5) + 6); > with > = obj->var = foo(bar, > =baz(1 + 3), > =baz(4 + 5) + 6); > > I'm used to the default in many editors, which is to indent twice > following each open paren, as opposed to once following each open > brace or continued statement. It is a simpler rule for automatic > formatting to implement. It also avoids mixing tabs and spaces, the > spacing is unambiguous, and to maintain style, there is no need to > re-indent lines following an edit if the position of the open paren > changes. > > It's interesting to me that this style is only enforced by > checkpatch.pl --strict. It is not in Documents/CodingStyle. That > being the case, would it be acceptable to relax the rule in > checkpatch.pl to accept either style? If not, well, I'll just accept > the chosen style and fix my code. I personally don't care much either way. > If the following looks acceptable to you, I'll submit the patch to the > list. The patch most likely wouldn't be appropriate for net/ and drivers/net/ where the developers seem to strongly prefer alignment to open parenthesis. Also I think the open parenthesis count isn't right as it would ask for multiple indents for code like: if ((foo(bar)) && (baz(bar))) { I think checkpatch would now want: if ((foo(bar)) && (baz(bar))) { and the --fix option would be wrong too. cheers, Joe > Thanks, > Allen > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl > index d124359..8e49125 100755 > --- a/scripts/checkpatch.pl > +++ b/scripts/checkpatch.pl > @@ -1834,6 +1834,15 @@ sub pos_last_openparen { > return length(expand_tabs(substr($line, 0, $last_openparen))) + 1; > } > > +sub count_openparen { > + my ($line) = @_; > + > + my $opens = $line =~ tr/\(/\(/; > + my $closes = $line =~ tr/\)/\)/; > + > + return $opens - $closes; > +} > + > sub process { > my $filename = shift; > > @@ -2539,11 +2548,16 @@ sub process { > " " x ($pos % 8); > my $goodspaceindent = $oldindent . " " x > $pos; > > + my $goodtwotabindent = $oldindent . > + "\t\t" x count_openparen($rest); > + > if ($newindent ne $goodtabindent && > - $newindent ne $goodspaceindent) { > + $newindent ne $goodspaceindent && > + $newindent ne $goodtwotabindent) { > > if (CHK("PARENTHESIS_ALIGNMENT", > - "Alignment should match open > parenthesis\n" . $hereprev) && > + "Alignment should match open > parenthesis, " . > + "or be twice indented for > each open parenthesis\n" . $hereprev) && > $fix && $line =~ /^\+/) { > $fixed[$fixlinenr] =~ > s/^\+[ > \t]*/\+$goodtabindent/; -- To unsubscribe from this list: send the line
Re: [Question] TCP stack performance decrease since 3.14
On Wed, 2015-04-15 at 15:31 -0400, rapier wrote: > All, > > First, my apologies if this came up previously but I couldn't find > anything using a keyword search of the mailing list archive. > > As part of the on going work with web10g I need to come up with baseline > TCP stack performance for various kernel revision. Using netperf and > super_netperf* I've found that performance for TCP_CC, TCP_RR, and > TCP_CRR has decreased since 3.14. > > 3.143.184.0 decrease % > TCP_CC183945 179222 175793 4.4% > TCP_RR594495 585484 561365 5.6% > TCP_CRR 98677 96726 93026 5.7% > > Stream tests have remained the same from 3.14 through 4.0. > > All tests were conducted on the same platform from clean boot with stock > kernels. > > So my questions are: > > Has anyone else seen this or is this a result of some weirdness on my > system or artifact of my tests? > > If others have seen this or is just simply to be expected (from new > features and the like) is it due to the TCP stack itself or other > changes in the kernel? > > If so, is there anyway to mitigate the effect of this via stack tuning, > kernel configuration, etc? > > Thanks! > > Chris > > > * The above results are the average of 10 iterations of super_netperf > for each test. I can run more iterations to verify the results but it > seem consistent. The number of parallel processes for each test was > tuned to produce the maximum test result. In other words, enough to push > things but not enough to cause performance hits due to being > cpu/memory/etc bound. If anyone wants the full results and test scripts > just let me know. > -- Make sure you do not hit a c-state issue. I've seen improvements in the stack translate to longer wait times, and cpu takes longer to exit deep c-state. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 2/5] stmmac: Add defines and documentation for enabling flow control
Add defines and documentation for enabling flow control on the stmmac. Flow control was not implemented correctly on the stmmac driver and is currently non-functional as a result. This is the first in a series of small patches to correctly implement this feature. Signed-off-by: Vince Bridgers --- drivers/net/ethernet/stmicro/stmmac/dwmac1000.h | 51 + 1 file changed, 51 insertions(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h index 64d8f56..b3fe057 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h @@ -172,6 +172,7 @@ enum inter_frame_gap { /* GMAC FLOW CTRL defines */ #define GMAC_FLOW_CTRL_PT_MASK 0x /* Pause Time Mask */ #define GMAC_FLOW_CTRL_PT_SHIFT16 +#define GMAC_FLOW_CTRL_UP 0x0008 /* Unicast pause frame enable */ #define GMAC_FLOW_CTRL_RFE 0x0004 /* Rx Flow Control Enable */ #define GMAC_FLOW_CTRL_TFE 0x0002 /* Tx Flow Control Enable */ #define GMAC_FLOW_CTRL_FCB_BPA 0x0001 /* Flow Control Busy ... */ @@ -246,6 +247,56 @@ enum ttc_control { #define DMA_CONTROL_FEF0x0080 #define DMA_CONTROL_FUF0x0040 +/* Receive flow control activation field + * RFA field in DMA control register, bits 23,10:9 + */ +#define DMA_CONTROL_RFA_MASK 0x00800600 + +/* Receive flow control deactivation field + * RFD field in DMA control register, bits 22,12:11 + */ +#define DMA_CONTROL_RFD_MASK 0x00401800 + +/* RFD and RFA fields are encoded as follows + * + * Bit Field + * 0,00 - Full minus 1KB (only valid when rxfifo >= 4KB and EFC enabled) + * 0,01 - Full minus 2KB (only valid when rxfifo >= 4KB and EFC enabled) + * 0,10 - Full minus 3KB (only valid when rxfifo >= 4KB and EFC enabled) + * 0,11 - Full minus 4KB (only valid when rxfifo > 4KB and EFC enabled) + * 1,00 - Full minus 5KB (only valid when rxfifo > 8KB and EFC enabled) + * 1,01 - Full minus 6KB (only valid when rxfifo > 8KB and EFC enabled) + * 1,10 - Full minus 7KB (only valid when rxfifo > 8KB and EFC enabled) + * 1,11 - Reserved + * + * RFD should always be > RFA for a given FIFO size. RFD == RFA may work, + * but packet throughput performance may not be as expected. + * + * Be sure that bit 3 in GMAC Register 6 is set for Unicast Pause frame + * detection (IEEE Specification Requirement, Annex 31B, 31B.1, Pause + * Description). + * + * Be sure that DZPA (bit 7 in Flow Control Register, GMAC Register 6), + * is set to 0. This allows pause frames with a quanta of 0 to be sent + * as an XOFF message to the link peer. + */ + +#define RFA_FULL_MINUS_1K 0x +#define RFA_FULL_MINUS_2K 0x0200 +#define RFA_FULL_MINUS_3K 0x0400 +#define RFA_FULL_MINUS_4K 0x0600 +#define RFA_FULL_MINUS_5K 0x0080 +#define RFA_FULL_MINUS_6K 0x00800200 +#define RFA_FULL_MINUS_7K 0x00800400 + +#define RFD_FULL_MINUS_1K 0x +#define RFD_FULL_MINUS_2K 0x0800 +#define RFD_FULL_MINUS_3K 0x1000 +#define RFD_FULL_MINUS_4K 0x1800 +#define RFD_FULL_MINUS_5K 0x0040 +#define RFD_FULL_MINUS_6K 0x00400800 +#define RFD_FULL_MINUS_7K 0x00401000 + enum rtc_control { DMA_CONTROL_RTC_64 = 0x, DMA_CONTROL_RTC_32 = 0x0008, -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 1/5] stmmac: Add properties for transmit and receive fifo sizes
The Synopsys stmmac fifo sizes are configurable, and need to be known in order to configure certain controller features. This patch adds tx-fifo-depth and rx-fifo-depth properties to the stmmac document file. Signed-off-by: Vince Bridgers --- Documentation/devicetree/bindings/net/ethernet.txt | 6 ++ Documentation/devicetree/bindings/net/stmmac.txt | 4 2 files changed, 10 insertions(+) diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt index 3fc3605..41b3f3f 100644 --- a/Documentation/devicetree/bindings/net/ethernet.txt +++ b/Documentation/devicetree/bindings/net/ethernet.txt @@ -19,6 +19,12 @@ The following properties are common to the Ethernet controllers: - phy: the same as "phy-handle" property, not recommended for new bindings. - phy-device: the same as "phy-handle" property, not recommended for new bindings. +- rx-fifo-depth: the size of the controller's receive fifo in bytes. This + is used for components that can have configurable receive fifo sizes, + and is useful for determining certain configuration settings such as + flow control thresholds. +- tx-fifo-depth: the size of the controller's transmit fifo in bytes. This + is used for components that can have configurable fifo sizes. Child nodes of the Ethernet controller are typically the individual PHY devices connected via the MDIO bus (sometimes the MDIO bus controller is separate). diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt index 8ca65ce..524c185 100644 --- a/Documentation/devicetree/bindings/net/stmmac.txt +++ b/Documentation/devicetree/bindings/net/stmmac.txt @@ -44,6 +44,8 @@ Optional properties: If not passed then the system clock will be used and this is fine on some platforms. - snps,burst_len: The AXI burst lenth value of the AXI BUS MODE register. +- tx-fifo-depth: See ethernet.txt file in the same directory +- rx-fifo-depth: See ethernet.txt file in the same directory Examples: @@ -58,6 +60,8 @@ Examples: phy-mode = "gmii"; snps,multicast-filter-bins = <256>; snps,perfect-filter-entries = <128>; + rx-fifo-depth = <16384>; + tx-fifo-depth = <16384>; clocks = <&clock>; clock-names = "stmmaceth"; }; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 4/5] stmmac: Enable unicast pause frame detect in GMAC Register 6
Unicast pause frame detect was not being enabled for the Synopsys stmmac. This patch sets Unicast pause frame detect in MAC register 6 so that pause frame detection by the stmmac conforms to IEEE 802.3, Annex 31B.3.3 Receive Operation - Specifically, a MAC shall respond to pause frames containing either the reserved multicast address or the unique physical address associated with this station. Signed-off-by: Vince Bridgers --- drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c index 0adcf73..371a669 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c @@ -201,7 +201,10 @@ static void dwmac1000_flow_ctrl(struct mac_device_info *hw, unsigned int duplex, unsigned int fc, unsigned int pause_time) { void __iomem *ioaddr = hw->pcsr; - unsigned int flow = 0; + /* Set flow such that DZPQ in Mac Register 6 is 0, +* and unicast pause detect is enabled. +*/ + unsigned int flow = GMAC_FLOW_CTRL_UP; pr_debug("GMAC Flow-Control:\n"); if (fc & FLOW_RX) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured
On Wed, Apr 15, 2015 at 10:21:49PM +0200, Rafał Miłecki wrote: > On 15 April 2015 at 22:05, Guenter Roeck wrote: > > arm:allmodconfig fails to build as follows since ARCH_BCM_5301X > > is configured but not BCM47XX. > > > > drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe': > > drivers/net/ethernet/broadcom/bgmac.c:1643:2: error: > > implicit declaration of function 'bcm47xx_nvram_getenv' > > > > Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X") > > Cc: Rafał Miłecki > > Signed-off-by: Guenter Roeck > > --- > > Seen in today's upstream kernel. > > > > I don't like this fix too much (I think it is quite kludgy), > > so I marked it RFC (and please don't beat the messenger ;-). > > Ooh great, I totally forgot about this :| > > The problem is that fc300dc (bgmac: allow enabling on ARCH_BCM_5301X) > [0] shouldn't really be sent (by me) in the first place. This is > because it depends on 138173d (MIPS: BCM47xx: Move NVRAM header to the > include/linux/.) [1] which isn't in Linus's tree yet. > > So there are two solutions: > 1) Revert fc300dc, wait for 138173d and re-apply fc300dc > 2) Wait for 138173d with this build breakage > > I guess the decisions depends on > a) time needed for David to revert fc300dc & send pull request > vs. > b) time needed for Ralf to send pull request > > David, Ralf, what do you think about this? > Since the fix is in the queue, maybe we can live with the breakage for a few days ? Your solution is definitely _much_ better than my hack. Guenter -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 3/5] stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
Read the tx-fifo-depth and rx-fifo-depth from the devicetree. The Synopsys stmmac controller fifos are configurable per product instance, and the fifo sizes are needed to configure certain features correctly such as flow control. Signed-off-by: Vince Bridgers --- drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 4 include/linux/stmmac.h| 2 ++ 2 files changed, 6 insertions(+) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index f9b42f1..705bbdf 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c @@ -181,6 +181,10 @@ static int stmmac_probe_config_dt(struct platform_device *pdev, sizeof(struct stmmac_mdio_bus_data), GFP_KERNEL); + of_property_read_u32(np, "tx-fifo-depth", &plat->tx_fifo_size); + + of_property_read_u32(np, "rx-fifo-depth", &plat->rx_fifo_size); + plat->force_sf_dma_mode = of_property_read_bool(np, "snps,force_sf_dma_mode"); diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h index cd63851..7f484a2 100644 --- a/include/linux/stmmac.h +++ b/include/linux/stmmac.h @@ -114,6 +114,8 @@ struct plat_stmmacenet_data { int maxmtu; int multicast_filter_bins; int unicast_filter_entries; + int tx_fifo_size; + int rx_fifo_size; void (*fix_mac_speed)(void *priv, unsigned int speed); void (*bus_setup)(void __iomem *ioaddr); void *(*setup)(struct platform_device *pdev); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 5/5] stmmac: Configure Flow Control to work correctly based on rxfifo size
Configure flow control correctly, and based on the receive fifo size read as a property from the devicetree since the Synopsys stmmac fifo sizes are configurable based on a particular chip's implementation. This patch maintains the previous incorrect behavior unless the receive fifo size is found in the devicetree. Signed-off-by: Vince Bridgers --- drivers/net/ethernet/stmicro/stmmac/common.h | 5 +++-- .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c| 26 +- drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c | 2 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 16 - 4 files changed, 40 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index cd77289..623c6ed 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -150,7 +150,7 @@ struct stmmac_extra_stats { #defineMAC_CSR_H_FRQ_MASK 0x20 #define HASH_TABLE_SIZE 64 -#define PAUSE_TIME 0x200 +#define PAUSE_TIME 0x /* Flow Control defines */ #define FLOW_OFF 0 @@ -357,7 +357,8 @@ struct stmmac_dma_ops { void (*dump_regs) (void __iomem *ioaddr); /* Set tx/rx threshold in the csr6 register * An invalid value enables the store-and-forward mode */ - void (*dma_mode) (void __iomem *ioaddr, int txmode, int rxmode); + void (*dma_mode)(void __iomem *ioaddr, int txmode, int rxmode, +int rxfifosz); /* To track extra statistic (if supported) */ void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x, void __iomem *ioaddr); diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c index 59d92e8..0e8937c 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c @@ -106,8 +106,29 @@ static int dwmac1000_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, return 0; } +static u32 dwmac1000_configure_fc(u32 csr6, int rxfifosz) +{ + csr6 &= ~DMA_CONTROL_RFA_MASK; + csr6 &= ~DMA_CONTROL_RFD_MASK; + + /* Leave flow control disabled if receive fifo size is less than +* 4K or 0. Otherwise, send XOFF when fifo is 1K less than full, +* and send XON when 2K less than full. +*/ + if (rxfifosz < 4096) { + csr6 &= ~DMA_CONTROL_EFC; + pr_debug("GMAC: disabling flow control, rxfifo too small(%d)\n", +rxfifosz); + } else { + csr6 |= DMA_CONTROL_EFC; + csr6 |= RFA_FULL_MINUS_1K; + csr6 |= RFD_FULL_MINUS_2K; + } + return csr6; +} + static void dwmac1000_dma_operation_mode(void __iomem *ioaddr, int txmode, -int rxmode) +int rxmode, int rxfifosz) { u32 csr6 = readl(ioaddr + DMA_CONTROL); @@ -153,6 +174,9 @@ static void dwmac1000_dma_operation_mode(void __iomem *ioaddr, int txmode, csr6 |= DMA_CONTROL_RTC_128; } + /* Configure flow control based on rx fifo size */ + csr6 = dwmac1000_configure_fc(csr6, rxfifosz); + writel(csr6, ioaddr + DMA_CONTROL); } diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c index 7d1dce9..9d0971c 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c @@ -72,7 +72,7 @@ static int dwmac100_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb, * control register. */ static void dwmac100_dma_operation_mode(void __iomem *ioaddr, int txmode, - int rxmode) + int rxmode, int rxfifosz) { u32 csr6 = readl(ioaddr + DMA_CONTROL); diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index a0ea84f..73d104f 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -1277,8 +1277,10 @@ static void free_dma_desc_resources(struct stmmac_priv *priv) */ static void stmmac_dma_operation_mode(struct stmmac_priv *priv) { + int rxfifosz = priv->plat->rx_fifo_size; + if (priv->plat->force_thresh_dma_mode) - priv->hw->dma->dma_mode(priv->ioaddr, tc, tc); + priv->hw->dma->dma_mode(priv->ioaddr, tc, tc, rxfifosz); else if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) { /* * In case of GMAC, SF mode can be enabled @@ -1287,10 +1289,12 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv) * 2) There is no bugged Jumbo frame support
[PATCH net 0/5] stmmac: Correct flow control configuration
This series of patches corrects flow control configuration for the Synopsys GMAC driver. Flow control is configured based on a configurable receive fifo size. If less than 4Kbytes flow control is left disabled and a warning is presented. If a receive fifo size is not specified, flow control is left disabled to maintain current behavior. Unicast pause detection was disabled, but is now enabled. The pause time was modified to be maximum time per a XON/XOFF flow control mode of operation. This patch was tested on an Altera Cyclone 5 and an Altera Arria 10 devkit, and verified that flow control operates as expected when enabled. Please consider this series for inclusion so that flow control will function as expected for the Synopsys GMAC controller. Vince Bridgers (5): stmmac: Add properties for transmit and receive fifo sizes stmmac: Add defines and documentation for enabling flow control stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree stmmac: Enable unicast pause frame detect in GMAC Register 6 stmmac: Configure Flow Control to work correctly based on rxfifo size Documentation/devicetree/bindings/net/ethernet.txt | 6 +++ Documentation/devicetree/bindings/net/stmmac.txt | 4 ++ drivers/net/ethernet/stmicro/stmmac/common.h | 5 ++- drivers/net/ethernet/stmicro/stmmac/dwmac1000.h| 51 ++ .../net/ethernet/stmicro/stmmac/dwmac1000_core.c | 5 ++- .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c| 26 ++- drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c | 2 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 16 --- .../net/ethernet/stmicro/stmmac/stmmac_platform.c | 4 ++ include/linux/stmmac.h | 2 + 10 files changed, 111 insertions(+), 10 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Question] TCP stack performance decrease since 3.14
All, First, my apologies if this came up previously but I couldn't find anything using a keyword search of the mailing list archive. As part of the on going work with web10g I need to come up with baseline TCP stack performance for various kernel revision. Using netperf and super_netperf* I've found that performance for TCP_CC, TCP_RR, and TCP_CRR has decreased since 3.14. 3.143.184.0 decrease % TCP_CC 183945 179222 175793 4.4% TCP_RR 594495 585484 561365 5.6% TCP_CRR 98677 96726 93026 5.7% Stream tests have remained the same from 3.14 through 4.0. All tests were conducted on the same platform from clean boot with stock kernels. So my questions are: Has anyone else seen this or is this a result of some weirdness on my system or artifact of my tests? If others have seen this or is just simply to be expected (from new features and the like) is it due to the TCP stack itself or other changes in the kernel? If so, is there anyway to mitigate the effect of this via stack tuning, kernel configuration, etc? Thanks! Chris * The above results are the average of 10 iterations of super_netperf for each test. I can run more iterations to verify the results but it seem consistent. The number of parallel processes for each test was tuned to produce the maximum test result. In other words, enough to push things but not enough to cause performance hits due to being cpu/memory/etc bound. If anyone wants the full results and test scripts just let me know. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured
On 15 April 2015 at 22:05, Guenter Roeck wrote: > arm:allmodconfig fails to build as follows since ARCH_BCM_5301X > is configured but not BCM47XX. > > drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe': > drivers/net/ethernet/broadcom/bgmac.c:1643:2: error: > implicit declaration of function 'bcm47xx_nvram_getenv' > > Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X") > Cc: Rafał Miłecki > Signed-off-by: Guenter Roeck > --- > Seen in today's upstream kernel. > > I don't like this fix too much (I think it is quite kludgy), > so I marked it RFC (and please don't beat the messenger ;-). Ooh great, I totally forgot about this :| The problem is that fc300dc (bgmac: allow enabling on ARCH_BCM_5301X) [0] shouldn't really be sent (by me) in the first place. This is because it depends on 138173d (MIPS: BCM47xx: Move NVRAM header to the include/linux/.) [1] which isn't in Linus's tree yet. So there are two solutions: 1) Revert fc300dc, wait for 138173d and re-apply fc300dc 2) Wait for 138173d with this build breakage I guess the decisions depends on a) time needed for David to revert fc300dc & send pull request vs. b) time needed for Ralf to send pull request David, Ralf, what do you think about this? Sorry for causing this problem :| [0] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fc300dc3733fdc328e6e10c7b8379b60c26cd648 [1] http://git.linux-mips.org/cgit/ralf/upstream-sfr.git/commit/?id=138173d4e826587da66c7d321da1a91283222536 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 11:32 AM, Eric Dumazet wrote: On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? Keep in mind some VM users install custom qdisc, or even custom TCP sysctls. That could very well be, though I confess I've not seen that happening in my little corner of the cloud. They tend to want to launch the VM and go. Some of the more advanced/sophisticated ones might tweak a few things but my (admittedly limited) experience has been they are few in number. They just expect it to work "out of the box" (to the extent one can use that phrase still). It's kind of ironic - go back to the (early) 1990s when NICs generated a completion interrupt for every individual tx completion (and incoming packet) and all everyone wanted to do was coalesce/avoid interrupts. I guess that has gone rather far. And today to fight bufferbloat TCP gets tweaked to favor quick tx completions. Call it cycles, or pendulums or whatever I guess. I wonder just how consistent tx completion timings are for a VM so a virtio_net or whatnot in the VM can pick a per-device setting to advertise to TCP? Hopefully, full NIC emulation is no longer a thing and VMs "universally" use a virtual NIC interface. At least in my little corner of the cloud, emulated NICs are gone, and good riddance. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured
arm:allmodconfig fails to build as follows since ARCH_BCM_5301X is configured but not BCM47XX. drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe': drivers/net/ethernet/broadcom/bgmac.c:1643:2: error: implicit declaration of function 'bcm47xx_nvram_getenv' Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X") Cc: Rafał Miłecki Signed-off-by: Guenter Roeck --- Seen in today's upstream kernel. I don't like this fix too much (I think it is quite kludgy), so I marked it RFC (and please don't beat the messenger ;-). drivers/net/ethernet/broadcom/bgmac.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c index 5cb93d1f50a4..dcf27a7c1836 100644 --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -17,8 +17,21 @@ #include #include #include + +#ifdef CONFIG_BCM47XX #include +static int __bcm47xx_nvram_getenv(const char *name, char *val, size_t len) +{ + return bcm47xx_nvram_getenv(name, val, len); +} +#else +static int __bcm47xx_nvram_getenv(const char *name, char *val, size_t len) +{ + return -ENOENT; +} +#endif + static const struct bcma_device_id bgmac_bcma_tbl[] = { BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_4706_MAC_GBIT, BCMA_ANY_REV, BCMA_ANY_CLASS), BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_MAC_GBIT, BCMA_ANY_REV, BCMA_ANY_CLASS), @@ -1070,7 +1083,7 @@ static void bgmac_chip_reset(struct bgmac *bgmac) BGMAC_CHIPCTL_1_IF_TYPE_MII; char buf[4]; - if (bcm47xx_nvram_getenv("et_swtype", buf, sizeof(buf)) > 0) { + if (__bcm47xx_nvram_getenv("et_swtype", buf, sizeof(buf)) > 0) { if (kstrtou8(buf, 0, &et_swtype)) bgmac_err(bgmac, "Failed to parse et_swtype (%s)\n", buf); @@ -1635,7 +1648,7 @@ static int bgmac_probe(struct bcma_device *core) } bgmac->int_mask = BGMAC_IS_ERRMASK | BGMAC_IS_RX | BGMAC_IS_TX_MASK; - if (bcm47xx_nvram_getenv("et0_no_txint", NULL, 0) == 0) + if (__bcm47xx_nvram_getenv("et0_no_txint", NULL, 0) == 0) bgmac->int_mask &= ~BGMAC_IS_TX_MASK; /* TODO: reset the external phy. Specs are needed */ -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] bpf: fix bpf helpers to use skb->mac_header relative offsets
For the short-term solution, lets fix bpf helper functions to use skb->mac_header relative offsets instead of skb->data in order to get the same eBPF programs with cls_bpf and act_bpf work on ingress and egress qdisc path. We need to ensure that mac_header is set before calling into programs. This is effectively the first option from below referenced discussion. More long term solution for LD_ABS|LD_IND instructions will be more intrusive but also more beneficial than this, and implemented later as it's too risky at this point in time. I.e., we plan to look into the option of moving skb_pull() out of eth_type_trans() and into netif_receive_skb() as has been suggested as second option. Meanwhile, this solution ensures ingress can be used with eBPF, too, and that we won't run into ABI troubles later. For dealing with negative offsets inside eBPF helper functions, we've implemented bpf_skb_clone_unwritable() to test for unwriteable headers. Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694 Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action") Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers") Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann --- include/uapi/linux/bpf.h|2 +- include/uapi/linux/filter.h |7 +-- net/core/filter.c | 41 - net/sched/act_bpf.c |3 +++ net/sched/cls_bpf.c |3 +++ samples/bpf/tcbpf1_kern.c | 16 ++-- 6 files changed, 50 insertions(+), 22 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 5c1cee11f777..a9ebdf5701e8 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -177,7 +177,7 @@ enum bpf_func_id { /** * skb_store_bytes(skb, offset, from, len, flags) - store bytes into packet * @skb: pointer to skb -* @offset: offset within packet from skb->data +* @offset: offset within packet from skb->mac_header * @from: pointer where to copy bytes from * @len: number of bytes to store into packet * @flags: bit 0 - if true, recompute skb->csum diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h index 34c7936ca114..c97340e43dd6 100644 --- a/include/uapi/linux/filter.h +++ b/include/uapi/linux/filter.h @@ -79,8 +79,11 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ #define SKF_AD_RANDOM 56 #define SKF_AD_VLAN_TPID 60 #define SKF_AD_MAX 64 -#define SKF_NET_OFF (-0x10) -#define SKF_LL_OFF(-0x20) +#define SKF_NET_OFF(-0x10) +#define SKF_LL_OFF (-0x20) + +#define BPF_NET_OFFSKF_NET_OFF +#define BPF_LL_OFF SKF_LL_OFF #endif /* _UAPI__LINUX_FILTER_H__ */ diff --git a/net/core/filter.c b/net/core/filter.c index b669e75d2b36..bf831a85c315 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1175,12 +1175,27 @@ int sk_attach_bpf(u32 ufd, struct sock *sk) return 0; } +/** + * bpf_skb_clone_not_writable - is the header of a clone not writable + * @skb: buffer to check + * @len: length up to which to write, can be negative + * + * Returns true if modifying the header part of the cloned buffer + * does require the data to be copied. I.e. this version works with + * negative lengths needed for eBPF case! + */ +static bool bpf_skb_clone_unwritable(const struct sk_buff *skb, int len) +{ + return skb_header_cloned(skb) || + (int) skb_headroom(skb) + len > skb->hdr_len; +} + #define BPF_RECOMPUTE_CSUM(flags) ((flags) & 1) static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags) { struct sk_buff *skb = (struct sk_buff *) (long) r1; - unsigned int offset = (unsigned int) r2; + int offset = (int) r2; void *from = (void *) (long) r3; unsigned int len = (unsigned int) r4; char buf[16]; @@ -1194,10 +1209,12 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags) * * so check for invalid 'offset' and too large 'len' */ - if (unlikely(offset > 0x || len > sizeof(buf))) + if (unlikely((u32) offset > 0x || len > sizeof(buf))) return -EFAULT; - if (skb_cloned(skb) && !skb_clone_writable(skb, offset + len)) + offset -= skb->data - skb_mac_header(skb); + if (unlikely(skb_cloned(skb) && +bpf_skb_clone_unwritable(skb, offset + len))) return -EFAULT; ptr = skb_header_pointer(skb, offset, len, buf); @@ -1232,15 +1249,18 @@ const struct bpf_func_proto bpf_skb_store_bytes_proto = { #define BPF_HEADER_FIELD_SIZE(flags) ((flags) & 0x0f) #define BPF_IS_PSEUDO_HEADER(flags)((flags) & 0x10) -static u64 bpf_l3_csum_replace(u64 r1, u64 offset, u64 from, u64 to, u64 flags) +static u64 bpf_l3_csum_replace(u64 r1, u64 r2, u64 from, u64 to, u64 flags) { struct sk_buff *skb = (stru
Re: [PATCH iproute2 -next] tc: built-in eBPF exec proxy
On Wed, Apr 15, 2015, at 16:52, Daniel Borkmann wrote: > This work follows upon commit 6256f8c9e45f ("tc, bpf: finalize eBPF > support for cls and act front-end") and takes up the idea proposed by > Hannes Frederic Sowa to spawn a shell (or any other command) that holds > generated eBPF map file descriptors. > > File descriptors, based on their id, are being fetched from the same > unix domain socket as demonstrated in the bpf_agent, the shell spawned > via execvpe(2) and the map fds passed over the environment, and thus > are made available to applications in the fashion of std{in,out,err} > for read/write access, for example in case of iproute2's examples/bpf/: > > # env | grep BPF > BPF_NUM_MAPS=3 > BPF_MAP1=6<- BPF_MAP_ID_QUEUE (id 1) > BPF_MAP0=5<- BPF_MAP_ID_PROTO (id 0) > BPF_MAP2=7<- BPF_MAP_ID_DROPS (id 2) > > # ls -la /proc/self/fd > [...] > lrwx--. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4 > lrwx--. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4 > lrwx--. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4 > [...] > lrwx--. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map > lrwx--. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map > lrwx--. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map > > The advantage (as opposed to the direct/native usage) is that now the > shell is map fd owner and applications can terminate and easily reattach > to descriptors w/o any kernel changes. Moreover, multiple applications > can easily read/write eBPF maps simultaneously. > > To further allow users for experimenting with that, next step is to add > a small helper that can get along with simple data types, so that also > shell scripts can make use of bpf syscall, f.e to read/write into maps. > > Generally, this allows for prepopulating maps, or any runtime altering > which could influence eBPF program behaviour (f.e. different run-time > classifications, skb modifications, ...), dumping of statistics, etc. > > Reference: > http://thread.gmane.org/gmane.linux.network/357471/focus=357860 > Suggested-by: Hannes Frederic Sowa > Signed-off-by: Daniel Borkmann Great worl! Reviewed-by: Hannes Frederic Sowa -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Patch net-next v2] fou: avoid missing unlock in failure path
Fixes: 7a6c8c34e5b7 ("fou: implement FOU_CMD_GET") Reported-by: Dan Carpenter Cc: Dan Carpenter Signed-off-by: Cong Wang --- net/ipv4/fou.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index af150b4..34968cd 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -711,11 +711,10 @@ static int fou_nl_dump(struct sk_buff *skb, struct netlink_callback *cb) cb->nlh->nlmsg_seq, NLM_F_MULTI, skb, FOU_CMD_GET); if (ret) - goto done; + break; } mutex_unlock(&fn->fou_lock); -done: cb->args[0] = idx; return skb->len; } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v2 0/2] switchdev: unify naming prefix
Wed, Apr 15, 2015 at 08:42:12PM CEST, da...@davemloft.net wrote: >From: Jiri Pirko >Date: Wed, 15 Apr 2015 20:38:15 +0200 > >> Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote: >>>On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote: Turned out that "switchdev" sticks. So just unify all releated terms to use this prefix. Jiri Pirko (2): switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/ switchdev: s/swdev_/switchdev_/ drivers/net/bonding/bond_main.c | 4 +- drivers/net/ethernet/rocker/rocker.c | 47 +- drivers/net/team/team.c | 4 +- include/linux/netdevice.h| 2 +- include/net/switchdev.h | 140 ++-- net/bridge/br.c | 22 ++--- net/bridge/br_netlink.c | 6 +- net/bridge/br_stp.c | 2 +- net/core/net-sysfs.c | 2 +- net/core/rtnetlink.c | 2 +- net/dsa/slave.c | 8 +- net/ipv4/fib_trie.c | 40 net/switchdev/switchdev.c| 174 +-- 13 files changed, 224 insertions(+), 229 deletions(-) >>> >>>Acked-by: Andy Gospodarek >>> >> >> Unfortunatelly, this patchset did not make it into this net-next. Will >> resubmit with the acks once the merge-window closes. > >Scott said he would integrate the name change into his series, since it's >now deferred as well. Oh, ok, I missed this info. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v2 0/2] switchdev: unify naming prefix
From: Jiri Pirko Date: Wed, 15 Apr 2015 20:38:15 +0200 > Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote: >>On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote: >>> Turned out that "switchdev" sticks. So just unify all releated terms to >>> use this prefix. >>> >>> Jiri Pirko (2): >>> switchdev: s/netdev_switch_/switchdev_/ and >>> s/NETDEV_SWITCH_/SWITCHDEV_/ >>> switchdev: s/swdev_/switchdev_/ >>> >>> drivers/net/bonding/bond_main.c | 4 +- >>> drivers/net/ethernet/rocker/rocker.c | 47 +- >>> drivers/net/team/team.c | 4 +- >>> include/linux/netdevice.h| 2 +- >>> include/net/switchdev.h | 140 ++-- >>> net/bridge/br.c | 22 ++--- >>> net/bridge/br_netlink.c | 6 +- >>> net/bridge/br_stp.c | 2 +- >>> net/core/net-sysfs.c | 2 +- >>> net/core/rtnetlink.c | 2 +- >>> net/dsa/slave.c | 8 +- >>> net/ipv4/fib_trie.c | 40 >>> net/switchdev/switchdev.c| 174 >>> +-- >>> 13 files changed, 224 insertions(+), 229 deletions(-) >> >>Acked-by: Andy Gospodarek >> > > Unfortunatelly, this patchset did not make it into this net-next. Will > resubmit with the acks once the merge-window closes. Scott said he would integrate the name change into his series, since it's now deferred as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v2 0/2] switchdev: unify naming prefix
Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote: >On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote: >> Turned out that "switchdev" sticks. So just unify all releated terms to >> use this prefix. >> >> Jiri Pirko (2): >> switchdev: s/netdev_switch_/switchdev_/ and >> s/NETDEV_SWITCH_/SWITCHDEV_/ >> switchdev: s/swdev_/switchdev_/ >> >> drivers/net/bonding/bond_main.c | 4 +- >> drivers/net/ethernet/rocker/rocker.c | 47 +- >> drivers/net/team/team.c | 4 +- >> include/linux/netdevice.h| 2 +- >> include/net/switchdev.h | 140 ++-- >> net/bridge/br.c | 22 ++--- >> net/bridge/br_netlink.c | 6 +- >> net/bridge/br_stp.c | 2 +- >> net/core/net-sysfs.c | 2 +- >> net/core/rtnetlink.c | 2 +- >> net/dsa/slave.c | 8 +- >> net/ipv4/fib_trie.c | 40 >> net/switchdev/switchdev.c| 174 >> +-- >> 13 files changed, 224 insertions(+), 229 deletions(-) > >Acked-by: Andy Gospodarek > Unfortunatelly, this patchset did not make it into this net-next. Will resubmit with the acks once the merge-window closes. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open
From: Eric Dumazet Date: Wed, 15 Apr 2015 11:22:44 -0700 > On Wed, 2015-04-15 at 19:00 +0100, Ben Hutchings wrote: >> Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly") >> changed tcp_send_syn_data() to perform an open-coded copy of the 'syn' >> skb rather than using skb_copy_expand(). >> >> The open-coded copy does not cover the skb_shared_info::gso_segs >> field, so in the new skb it is left set to 0. When this commit was >> backported into stable branches between 3.10.y and 3.16.7-ckty >> inclusive, it triggered the BUG() in tcp_transmit_skb(). >> >> Since Linux 3.18 the GSO segment count is kept in the >> tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the >> tcp_skb_cb structure to the new skb, so mainline and newer stable >> branches are not affected. >> >> Set skb_shared_info::gso_segs to the correct value of 1. >> >> Signed-off-by: Ben Hutchings ... > Looks goot to me, thanks Ben ! > > Acked-by: Eric Dumazet Ben, thanks for taking care of this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: > Well, I'm not sure that it is George and Jonathan themselves who don't > want to change a sysctl, but the customers who would have to tweak that > in their VMs? Keep in mind some VM users install custom qdisc, or even custom TCP sysctls. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: > Maybe you should stop wasting all of our time and just tell us what > you're thinking. I think you make me wasting my time. I already gave all the hints in prior discussions. Rome was not built in one day. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open
On Wed, 2015-04-15 at 19:00 +0100, Ben Hutchings wrote: > Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly") > changed tcp_send_syn_data() to perform an open-coded copy of the 'syn' > skb rather than using skb_copy_expand(). > > The open-coded copy does not cover the skb_shared_info::gso_segs > field, so in the new skb it is left set to 0. When this commit was > backported into stable branches between 3.10.y and 3.16.7-ckty > inclusive, it triggered the BUG() in tcp_transmit_skb(). > > Since Linux 3.18 the GSO segment count is kept in the > tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the > tcp_skb_cb structure to the new skb, so mainline and newer stable > branches are not affected. > > Set skb_shared_info::gso_segs to the correct value of 1. > > Signed-off-by: Ben Hutchings > --- > net/ipv4/tcp_output.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index d5457e4..1ea0a07 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -2992,6 +2992,7 @@ static int tcp_send_syn_data(struct sock *sk, struct > sk_buff *syn) > goto fallback; > syn_data->ip_summed = CHECKSUM_PARTIAL; > memcpy(syn_data->cb, syn->cb, sizeof(syn->cb)); > + skb_shinfo(syn_data)->gso_segs = 1; > if (unlikely(memcpy_fromiovecend(skb_put(syn_data, space), >fo->data->msg_iov, 0, space))) { > kfree_skb(syn_data); > Looks goot to me, thanks Ben ! Acked-by: Eric Dumazet -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 11:08 AM, Eric Dumazet wrote: On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? Last time I checked, default qdisc was pfifo_fast. Bummer. These guys do not want to change a sysctl, how pfifo_fast will magically becomes fq_codel ? Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 18:58 +0100, Stefano Stabellini wrote: > On Wed, 15 Apr 2015, Eric Dumazet wrote: > > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: > > > > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is > > > *already* larger for Xen; that calculation mentioned in the comment is > > > *already* doing the right thing. > > > > Sigh. > > > > 1ms of traffic at 40Gbit is 5 MBytes > > > > The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is > > to provide the limitation of ~2 TSO packets, which _also_ is documented. > > > > Without this limitation, 5 MBytes could translate to : Fill the queue, > > do not limit. > > > > If a particular driver needs to extend the limit, fine, document it and > > take actions. > > What actions do you have in mind exactly? It would be great if you > could suggest how to move forward from here, beside documentation. > > I don't think we can really expect every user that spawns a new VM in > the cloud to manually echo blah > > /proc/sys/net/ipv4/tcp_limit_output_bytes to an init script. I cannot > imagine that would work well. I already pointed a discussion on the same topic for wireless adapters. Some adapters have a ~3 ms TX completion delay, so the 1ms assumption in TCP stack is limiting the max throughput. All I hear here are unreasonable requests, marketing driven. If a global sysctl is not good enough, make it a per device value. We already have netdev->gso_max_size and netdev->gso_max_segs which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs What about you guys provide a new netdev->I_need_to_have_big_buffers_to_cope_with_my_latencies. Do not expect me to fight bufferbloat alone. Be part of the challenge, instead of trying to get back to proven bad solutions. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: > > > > Have you tested this patch on a NIC without GSO/TSO ? > > > > This would allow more than 500 packets for a single flow. > > > > Hello bufferbloat. > > Woudln't the fq_codel qdisc on that interface address that problem? Last time I checked, default qdisc was pfifo_fast. These guys do not want to change a sysctl, how pfifo_fast will magically becomes fq_codel ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 06:52 PM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: > >> So you'd be OK with a patch like this? (With perhaps a better changelog?) >> >> -George >> >> --- >> TSQ: Raise default static TSQ limit >> >> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the >> size of actual packets and the amount of data being transmitted. >> Raise the default static limit to allow that new limit to actually >> come into effect. >> >> This fixes a regression where NICs with large transmit completion >> times (such as xennet) had a 30% hit unless the user manually tweaked >> the value in /proc. >> >> Signed-off-by: George Dunlap >> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c >> index 1db253e..8ad7cdf 100644 >> --- a/net/ipv4/tcp_output.c >> +++ b/net/ipv4/tcp_output.c >> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; >> */ >> int sysctl_tcp_workaround_signed_windows __read_mostly = 0; >> >> -/* Default TSQ limit of two TSO segments */ >> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; >> +/* Static TSQ limit. A more dynamic limit is calculated in >> tcp_write_xmit. */ >> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; >> >> /* This limits the percentage of the congestion window which we >> * will allow a single TSO frame to consume. Building TSO frames >> > > Have you tested this patch on a NIC without GSO/TSO ? > > This would allow more than 500 packets for a single flow. > > Hello bufferbloat. > > So my answer to this patch is a no. You said: "I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes You have no need to explain me the code I wrote, thank you." Which implies to me that you think you've already pointed us to the answer you want and we're just not getting it. Maybe you should stop wasting all of our time and just tell us what you're thinking. -George -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 15 Apr 2015, Eric Dumazet wrote: > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: > > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is > > *already* larger for Xen; that calculation mentioned in the comment is > > *already* doing the right thing. > > Sigh. > > 1ms of traffic at 40Gbit is 5 MBytes > > The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is > to provide the limitation of ~2 TSO packets, which _also_ is documented. > > Without this limitation, 5 MBytes could translate to : Fill the queue, > do not limit. > > If a particular driver needs to extend the limit, fine, document it and > take actions. What actions do you have in mind exactly? It would be great if you could suggest how to move forward from here, beside documentation. I don't think we can really expect every user that spawns a new VM in the cloud to manually echo blah > /proc/sys/net/ipv4/tcp_limit_output_bytes to an init script. I cannot imagine that would work well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open
Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly") changed tcp_send_syn_data() to perform an open-coded copy of the 'syn' skb rather than using skb_copy_expand(). The open-coded copy does not cover the skb_shared_info::gso_segs field, so in the new skb it is left set to 0. When this commit was backported into stable branches between 3.10.y and 3.16.7-ckty inclusive, it triggered the BUG() in tcp_transmit_skb(). Since Linux 3.18 the GSO segment count is kept in the tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the tcp_skb_cb structure to the new skb, so mainline and newer stable branches are not affected. Set skb_shared_info::gso_segs to the correct value of 1. Signed-off-by: Ben Hutchings --- net/ipv4/tcp_output.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d5457e4..1ea0a07 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2992,6 +2992,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn) goto fallback; syn_data->ip_summed = CHECKSUM_PARTIAL; memcpy(syn_data->cb, syn->cb, sizeof(syn->cb)); + skb_shinfo(syn_data)->gso_segs = 1; if (unlikely(memcpy_fromiovecend(skb_put(syn_data, space), fo->data->msg_iov, 0, space))) { kfree_skb(syn_data); -- Ben Hutchings Editing code like this is akin to sticking plasters on the bleeding stump of a severed limb. - me, 29 June 1999 signature.asc Description: This is a digitally signed message part
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch net-next v2 0/2] switchdev: unify naming prefix
On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote: > Turned out that "switchdev" sticks. So just unify all releated terms to > use this prefix. > > Jiri Pirko (2): > switchdev: s/netdev_switch_/switchdev_/ and > s/NETDEV_SWITCH_/SWITCHDEV_/ > switchdev: s/swdev_/switchdev_/ > > drivers/net/bonding/bond_main.c | 4 +- > drivers/net/ethernet/rocker/rocker.c | 47 +- > drivers/net/team/team.c | 4 +- > include/linux/netdevice.h| 2 +- > include/net/switchdev.h | 140 ++-- > net/bridge/br.c | 22 ++--- > net/bridge/br_netlink.c | 6 +- > net/bridge/br_stp.c | 2 +- > net/core/net-sysfs.c | 2 +- > net/core/rtnetlink.c | 2 +- > net/dsa/slave.c | 8 +- > net/ipv4/fib_trie.c | 40 > net/switchdev/switchdev.c| 174 > +-- > 13 files changed, 224 insertions(+), 229 deletions(-) Acked-by: Andy Gospodarek -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: > So you'd be OK with a patch like this? (With perhaps a better changelog?) > > -George > > --- > TSQ: Raise default static TSQ limit > > A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the > size of actual packets and the amount of data being transmitted. > Raise the default static limit to allow that new limit to actually > come into effect. > > This fixes a regression where NICs with large transmit completion > times (such as xennet) had a 30% hit unless the user manually tweaked > the value in /proc. > > Signed-off-by: George Dunlap > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1db253e..8ad7cdf 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; > */ > int sysctl_tcp_workaround_signed_windows __read_mostly = 0; > > -/* Default TSQ limit of two TSO segments */ > -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; > +/* Static TSQ limit. A more dynamic limit is calculated in > tcp_write_xmit. */ > +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; > > /* This limits the percentage of the congestion window which we > * will allow a single TSO frame to consume. Building TSO frames > Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. So my answer to this patch is a no. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 06:29 PM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: >> On 04/15/2015 05:38 PM, Eric Dumazet wrote: >>> My thoughts that instead of these long talks you should guys read the >>> code : >>> >>> /* TCP Small Queues : >>> * Control number of packets in qdisc/devices to two >>> packets / or ~1 ms. >>> * This allows for : >>> * - better RTT estimation and ACK scheduling >>> * - faster recovery >>> * - high rates >>> * Alas, some drivers / subsystems require a fair amount >>> * of queued bytes to ensure line rate. >>> * One example is wifi aggregation (802.11 AMPDU) >>> */ >>> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); >>> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); >>> >>> >>> Then you'll see that most of your questions are already answered. >>> >>> Feel free to try to improve the behavior, if it does not hurt critical >>> workloads >>> like TCP_RR, where we we send very small messages, millions times per >>> second. >> >> First of all, with regard to critical workloads, once this patch gets >> into distros, *normal TCP streams* on every VM running on Amazon, >> Rackspace, Linode, &c will get a 30% hit in performance *by default*. >> Normal TCP streams on xennet *are* a critical workload, and deserve the >> same kind of accommodation as TCP_RR (if not more). The same goes for >> virtio_net. >> >> Secondly, according to Stefano's and Jonathan's tests, >> tcp_limit_output_bytes completely fixes the problem for Xen. >> >> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is >> *already* larger for Xen; that calculation mentioned in the comment is >> *already* doing the right thing. >> >> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an >> automatic TSQ calculation which is actually choosing an effective value >> for xennet. >> >> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an >> actual maximum limit. I went back and looked at the original patch >> which introduced it (46d3ceabd), and it looks to me like it was designed >> to be a rough, quick estimate of "two packets outstanding" (by choosing >> the maximum size of the packet, 64k, and multiplying it by two). >> >> Now that you have a better algorithm -- the size of 2 actual packets or >> the amount transmitted in 1ms -- it seems like the default >> sysctl_tcp_limit_output_bytes should be higher, and let the automatic >> TSQ you have on the first line throttle things down when necessary. > > > I asked you guys to make a test by increasing > sysctl_tcp_limit_output_bytes So you'd be OK with a patch like this? (With perhaps a better changelog?) -George --- TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume. Building TSO frames TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is > *already* larger for Xen; that calculation mentioned in the comment is > *already* doing the right thing. Sigh. 1ms of traffic at 40Gbit is 5 MBytes The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is to provide the limitation of ~2 TSO packets, which _also_ is documented. Without this limitation, 5 MBytes could translate to : Fill the queue, do not limit. If a particular driver needs to extend the limit, fine, document it and take actions. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: > On 04/15/2015 05:38 PM, Eric Dumazet wrote: > > My thoughts that instead of these long talks you should guys read the > > code : > > > > /* TCP Small Queues : > > * Control number of packets in qdisc/devices to two > > packets / or ~1 ms. > > * This allows for : > > * - better RTT estimation and ACK scheduling > > * - faster recovery > > * - high rates > > * Alas, some drivers / subsystems require a fair amount > > * of queued bytes to ensure line rate. > > * One example is wifi aggregation (802.11 AMPDU) > > */ > > limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); > > limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); > > > > > > Then you'll see that most of your questions are already answered. > > > > Feel free to try to improve the behavior, if it does not hurt critical > > workloads > > like TCP_RR, where we we send very small messages, millions times per > > second. > > First of all, with regard to critical workloads, once this patch gets > into distros, *normal TCP streams* on every VM running on Amazon, > Rackspace, Linode, &c will get a 30% hit in performance *by default*. > Normal TCP streams on xennet *are* a critical workload, and deserve the > same kind of accommodation as TCP_RR (if not more). The same goes for > virtio_net. > > Secondly, according to Stefano's and Jonathan's tests, > tcp_limit_output_bytes completely fixes the problem for Xen. > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is > *already* larger for Xen; that calculation mentioned in the comment is > *already* doing the right thing. > > As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an > automatic TSQ calculation which is actually choosing an effective value > for xennet. > > It certainly makes sense for sysctl_tcp_limit_output_bytes to be an > actual maximum limit. I went back and looked at the original patch > which introduced it (46d3ceabd), and it looks to me like it was designed > to be a rough, quick estimate of "two packets outstanding" (by choosing > the maximum size of the packet, 64k, and multiplying it by two). > > Now that you have a better algorithm -- the size of 2 actual packets or > the amount transmitted in 1ms -- it seems like the default > sysctl_tcp_limit_output_bytes should be higher, and let the automatic > TSQ you have on the first line throttle things down when necessary. I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes You have no need to explain me the code I wrote, thank you. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On 04/15/2015 05:38 PM, Eric Dumazet wrote: > My thoughts that instead of these long talks you should guys read the > code : > > /* TCP Small Queues : > * Control number of packets in qdisc/devices to two packets > / or ~1 ms. > * This allows for : > * - better RTT estimation and ACK scheduling > * - faster recovery > * - high rates > * Alas, some drivers / subsystems require a fair amount > * of queued bytes to ensure line rate. > * One example is wifi aggregation (802.11 AMPDU) > */ > limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); > limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); > > > Then you'll see that most of your questions are already answered. > > Feel free to try to improve the behavior, if it does not hurt critical > workloads > like TCP_RR, where we we send very small messages, millions times per second. First of all, with regard to critical workloads, once this patch gets into distros, *normal TCP streams* on every VM running on Amazon, Rackspace, Linode, &c will get a 30% hit in performance *by default*. Normal TCP streams on xennet *are* a critical workload, and deserve the same kind of accommodation as TCP_RR (if not more). The same goes for virtio_net. Secondly, according to Stefano's and Jonathan's tests, tcp_limit_output_bytes completely fixes the problem for Xen. Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is *already* larger for Xen; that calculation mentioned in the comment is *already* doing the right thing. As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an automatic TSQ calculation which is actually choosing an effective value for xennet. It certainly makes sense for sysctl_tcp_limit_output_bytes to be an actual maximum limit. I went back and looked at the original patch which introduced it (46d3ceabd), and it looks to me like it was designed to be a rough, quick estimate of "two packets outstanding" (by choosing the maximum size of the packet, 64k, and multiplying it by two). Now that you have a better algorithm -- the size of 2 actual packets or the amount transmitted in 1ms -- it seems like the default sysctl_tcp_limit_output_bytes should be higher, and let the automatic TSQ you have on the first line throttle things down when necessary. -George -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll
On Wed, 2015-04-15 at 14:03 +, Yuval Mintz wrote: > > > > > > BTW, this looks quite generic - isn't it possible to take it out of > > > the driver and push it into the networking core, uniforming it in the > > > process? > > > > Yes, this is the plan I have in mind, once net-next is opened again ;) > > > > Thanks ! > > > > I'm not familiar with any real strong test-suite for the busy poll, but I did > try > running all kinds of netperf connections with various busy_{poll,read} values > and performance seemed reasonable, and verified that netconsole manages > to work. > > So for all it's worth, > Tested-by: Yuval Mintz Thanks a lot ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bpf: fix verifier memory corruption
On 04/15/2015 12:57 AM, Alexei Starovoitov wrote: Due to missing bounds check the DAG pass of the BPF verifier can corrupt the memory which can cause random crashes during program loading: [8.449451] BUG: unable to handle kernel paging request at [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0 [8.452329] Oops: [#1] SMP [8.452329] Call Trace: [8.452329] [] bpf_check+0x852/0x2000 [8.452329] [] bpf_prog_load+0x1e4/0x310 [8.452329] [] ? might_fault+0x5f/0xb0 [8.452329] [] SyS_bpf+0x806/0xa30 Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier") Signed-off-by: Alexei Starovoitov As far as I can tell, looks good to me. Any other access to a next instruction elsewhere would be blocked from push_insn() with an error. Acked-by: Daniel Borkmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Wed, 2015-04-15 at 15:36 +0100, Ian Campbell wrote: > On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote: > > On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley > [...] > > > From a networking point of view, the backend is a switch. Is it OK to > > > consider the packet to have been transmitted from the guest point of > > > view once the backend is aware of the packet? > > > > > > This would help justify the skb_orphan() in the frontend. > > > > This sounds sensible to me, particularly if virtio_net is already doing it. > > I also find Malcolm's argument above pretty compelling. Yes, and then you'll have to help the virtio ongoing effort trying to get rid of this skb_orphan() Basically you're adding head of line blocking, as a single flow is able to fill your queue. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote: > On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet wrote: > > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote: > > > >> Is the problem perhaps that netback/netfront delays TX completion? > >> Would it be better to see if that can be addressed properly, so that > >> the original purpose of the patch (fighting bufferbloat) can be > >> achieved while not degrading performance for Xen? Or at least, so > >> that people get decent perfomance out of the box without having to > >> tweak TCP parameters? > > > > Sure, please provide a patch, that does not break back pressure. > > > > But just in case, if Xen performance relied on bufferbloat, it might be > > very difficult to reach a stable equilibrium : Any small change in stack > > or scheduling might introduce a significant difference in 'raw > > performance'. > > So help me understand this a little bit here. tcp_limit_output_bytes > limits the amount of data allowed to be "in-transit" between a send() > and the wire, is that right? > > And so the "bufferbloat" problem you're talking about here are TCP > buffers inside the kernel, and/or buffers in the NIC, is that right? > > So ideally, you want this to be large enough to fill the "pipeline" > all the way from send() down to actually getting out on the wire; > otherwise, you'll have gaps in the pipeline, and the machinery won't > be working at full throttle. > > And the reason it's a problem is that many NICs now come with large > send buffers; and effectively what happens then is that this makes the > "pipeline" longer -- as the buffer fills up, the time between send() > and the wire is increased. This increased latency causes delays in > round-trip-times and interferes with the mechanisms TCP uses to try to > determine what the actual sustainable rate of data trasmission is. > > By limiting the number of "in-transit" bytes, you make sure that > neither the kernel nor the NIC are going to have packets queues up for > long lengths of time in buffers, and you keep this "pipeline" as close > to the actual minimal length of the pipeline as possible. And it > sounds like for your 40G NIC, 128k is big enough to fill the pipeline > without unduly making it longer by introducing buffering. > > Is that an accurate picture of what you're trying to achieve? > > But the problem for xennet (and a number of other drivers), as I > understand it, is that at the moment the "pipeline" itself is just > longer -- it just takes a longer time from the time you send a packet > to the time it actually gets out on the wire. > > So it's not actually accurate to say that "Xen performance relies on > bufferbloat". There's no buffering involved -- the pipeline is just > longer, and so to fill up the pipeline you need more data. > > Basically, to maximize throughput while minimizing buffering, for > *any* connection, tcp_limit_output_bytes should ideally be around > (min_tx_latency * max_bandwidth). For physical NICs, the minimum > latency is really small, but for xennet -- and I'm guessing for a lot > of virtualized cards -- the min_tx_latency will be a lot higher, > requiring a much higher ideal tcp_limit_output value. > > Rather than trying to pick a single value which will be good for all > NICs, it seems like it would make more sense to have this vary > depending on the parameters of the NIC. After all, for NICs that have > low throughput -- say, old 100MiB NICs -- even 128k may still > introduce a significant amount of buffering. > > Obviously one solution would be to allow the drivers themselves to set > the tcp_limit_output_bytes, but that seems like a maintenance > nightmare. > > Another simple solution would be to allow drivers to indicate whether > they have a high transmit latency, and have the kernel use a higher > value by default when that's the case. > > Probably the most sustainable solution would be to have the networking > layer keep track of the average and minimum transmit latencies, and > automatically adjust tcp_limit_output_bytes based on that. (Keeping > the minimum as well as the average because the whole problem with > bufferbloat is that the more data you give it, the longer the apparent > "pipeline" becomes.) > > Thoughts? My thoughts that instead of these long talks you should guys read the code : /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery * - high rates * Alas, some drivers / subsystems require a fair amount * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); limit = min_t(u32, limit, sysctl_tcp_l
Re: [PATCH 3/4] mac80211-hwsim: Pass rate-ctrl flags and tx-power to user-space
On 04/14/2015 01:15 AM, Johannes Berg wrote: > On Fri, 2015-04-03 at 14:12 -0700, gree...@candelatech.com wrote: > >> +/* Auxilary info to allow user-space to better understand the rate */ >> +struct hwsim_tx_rate2 { >> +u16 rc_flags; /* rate-ctrl flags (see mac80211_rate_control_flags) */ > > I really don't like this - it ties internal API to userspace API. You > may argue that this is userspace that is for debug purposes only, but > I'm sure you'll also scratch your head very confused when I change the > rate control flags for any reason and your code breaks :) Is this a documentation only issue (at this point, while code matches)? Then later, we can add translation to keep user-space API the same as needed? And in that case, maybe I'll make it u32 to give room down the road? Thanks, Ben -- Ben Greear Candela Technologies Inc http://www.candelatech.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
On Wed, Apr 15, 2015 at 12:35:16AM -0700, John Fastabend wrote: > > I'll dig up my scripts and post them to github this weekend. They > are a bit organized and all over the place at the moment. > > Maybe we can build a master repository. I know there a lot of different > scripts running around, for example I already collected a few from > Jamal and I think Cong must have some as well. great. let's start building a proper testsuite. > diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c > index 4cdbfb8..a2542ac 100644 > --- a/net/sched/sch_ingress.c > +++ b/net/sched/sch_ingress.c > @@ -69,7 +69,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct > Qdisc *sch) > switch (result) { > case TC_ACT_SHOT: > result = TC_ACT_SHOT; > - qdisc_qstats_drop(sch); > + qdisc_qstats_drop_cpu(sch); Your quick patch missed updating: - qdisc_bstats_update(sch, skb); + qdisc_bstats_update_cpu(sch, skb); I've tried the same :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bpf: fix verifier memory corruption
On Wed, Apr 15, 2015, at 18:07, Alexei Starovoitov wrote: > On 4/15/15 8:59 AM, Hannes Frederic Sowa wrote: > > On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote: > >> Due to missing bounds check the DAG pass of the BPF verifier can corrupt > >> the memory which can cause random crashes during program loading: > >> > >> [8.449451] BUG: unable to handle kernel paging request at > >> [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0 > >> [8.452329] Oops: [#1] SMP > >> [8.452329] Call Trace: > >> [8.452329] [] bpf_check+0x852/0x2000 > >> [8.452329] [] bpf_prog_load+0x1e4/0x310 > >> [8.452329] [] ? might_fault+0x5f/0xb0 > >> [8.452329] [] SyS_bpf+0x806/0xa30 > >> > >> Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier") > >> Signed-off-by: Alexei Starovoitov > >> --- > >> Many things need to align for this crash to be seen, yet I managed to hit > >> it. > >> In my case JA was last insn, 't' was 255 and explored_states array > >> had 256 elements. I've double checked other similar paths and all seems > >> clean. > >> > >> kernel/bpf/verifier.c |3 ++- > >> 1 file changed, 2 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >> index a28e09c7825d..36508e69e92a 100644 > >> --- a/kernel/bpf/verifier.c > >> +++ b/kernel/bpf/verifier.c > >> @@ -1380,7 +1380,8 @@ peek_stack: > >>/* tell verifier to check for equivalent states > >> * after every call and jump > >> */ > >> - env->explored_states[t + 1] = STATE_LIST_MARK; > >> + if (t + 1 < insn_cnt) > >> + env->explored_states[t + 1] = STATE_LIST_MARK; > >>} else { > >>/* conditional jump with two edges */ > >>ret = push_insn(t, t + 1, FALLTHROUGH, env); > > > > Quick review: > > > > We have env->explored_states[t+1] access in the > > > > } else { > > /* conditional jump with two edges */ > > ret = push_insn(t, t + 1, FALLTHROUGH, env); > > if (ret == 1) > > goto peek_stack; > > else if (ret < 0) > > goto err_free; > > > ret = push_insn(t, t + insns[t].off + 1, BRANCH, > env); > > if (ret == 1) > > goto peek_stack; > > else if (ret < 0) > > goto err_free; > > } > > } else { > > > > > > push_insn call. At this point insn[t].off could be 0, no? > > insn[t].off can be anything, but the first thing that push_insn() > checks is: > if (w < 0 || w >= env->prog->len) > only then it does: > env->explored_states[w] = STATE_LIST_MARK; > so we're good there. > Though thanks for triple checking :) Sorry, yes. That check was too obvious to me. ;) Acked-by: Hannes Frederic Sowa Bye, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 1/4] cxgb4/iw_cxgb4/cxgb4i: remove duplicate definitions
On Wed, Apr 15, 2015 at 06:34:27PM +0530, Varun Prakash wrote: > On Mon, Apr 13, 2015 at 10:38:08AM -0600, Jason Gunthorpe wrote: > > On Mon, Apr 13, 2015 at 07:34:23PM +0530, Varun Prakash wrote: > > > define struct ulptx_idata in common header file t4_msg.h > > > to remove duplicate definitions. > > > > The Infiniband side of this patch looks OK. > > > > Reviewed-By: Jason Gunthorpe > > > > Just some random thoughts on the other patches: > > - Try and use 'if (IS_ENABLED(CONFIG_XX))' over #ifdef > >to improve compile test coverage. This would drop a fair number > >of ifdefs. > > FCoE specific structures and functions are defined only if > CONFIG_CHELSIO_T4_FCOE is enabled. I saw a number of places that could be switched, you should look through. > #ifdef CONFIG_CHELSIO_T4_FCOE > void cxgb_fcoe_init_ddp(struct adapter *adap) > { > ... > } > #endif > > If CONFIG_CHELSIO_T4_FCOE is disabled then following code will > result in build error "implicit declaration of function cxgb_fcoe_init_ddp" > > if (IS_ENABLED(CONFIG_CHELSIO_T4_FCOE)) > cxgb_fcoe_init_ddp(adap); Commonly the ifdef would be moved to the function header and a static inline cxgb_fcoe_init_ddp(..) {} declared for the ifndef case. For the case of static function calls, just drop the ifdefing entirely. The compiler will silently drop unused static functions. Think carefully about ifdefing out structure members - I don't think saving a few bytes of heap in a driver like cxgb4 is worthwhile. I also just noticed that drivers/net/ethernet/chelsio/cxgb4/cxgb4_fcoe.c Is wrapped in a whole file #ifdef CONFIG_CHELSIO_T4_FCOE - please use the makefile instead for this. Jason -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
On Fri, 10 Apr 2015 15:47:34 +0200 Daniel Borkmann wrote: > On 04/10/2015 02:15 PM, Pablo Neira Ayuso wrote: > > This patch splits __netif_receive_skb_core() in smaller functions to improve > > maintainability. > > > > The function __netif_receive_skb_core() has been split in two: > > > > * __netif_receive_skb_ingress(), to perform all actions up to > >ingress filtering. > > > > * __netif_receive_skb_finish(), if the ingress filter accepts this > >packet, pass it to the corresponding packet_type function handler for > > further > > processing. > > > > This patch also adds __NET_RX_ANOTHER_ROUND that is used when the packet is > > stripped off from the vlan header or in case the rx_handler needs it. > > > > This also prepares the introduction of the netfilter ingress hook. > > Out of curiosity, what is actually the performance impact on all > of this? We were just arguing on a different matter on two more > instructions in the fast-path, here it's refactoring the whole > function into several ones, I presume gcc won't inline it. Pablo asked me to performance test this change. Full test report below. The performance effect (of this patch) depend on the Gcc compiler version. Two tests: 1. IP-forwarding (unloaded netfilter modules) 2. Early drop in iptables "raw" table With GCC 4.4.7, which does not inline the new functions (__netif_receive_skb_ingress and __netif_receive_skb_finish) the performance impact/regression is definitly measurable. With GCC 4.4.7: 1. IP-forwarding: +25.18 ns (slower) (-27776 pps) 2. Early-drop : +7.55 ns (slower) (-66577 pps) With GCC 4.9.1, the new functions gets inlined, thus the refactor splitup of __netif_receive_skb_core() is basically "cancled". Strangly there is a small improvement for forwarding, likely due to some lucky assember reordering that give less icache/fetch-misses. The early-drop improvement is below accuracy levels, can cannot be trusted. With GCC 4.9.1: 1. IP-forwarding: -10.05ns (faster) (+17532 pps) 2. Early-drop : -1.54ns (faster) (+16216 pps) below accuracy levels I don't know what to conclude, as the result depend on the compiler version... but these kind of change do affect performance, and should be tested/measured. - -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer Quick eval of Pablo's refactor of __netif_receive_skb_core == :Version: 0.1 :Author: Jesper Dangaard Brouer Summary === Pablo is refactoring __netif_receive_skb_core() into several functions, to allow for some other upcomming changes. Daniel Borkmann question if this will affect performance. Jesper tests this. The performance effect (of this patch) depend on the Gcc compiler version. Two tests: 1. IP-forwarding (unloaded netfilter modules) 2. Early drop in iptables "raw" table With GCC 4.4.7, which does not inline the new functions (__netif_receive_skb_ingress and __netif_receive_skb_finish) the performance impact/regression is definitly measurable. With GCC 4.4.7: 1. IP-forwarding: +25.18 ns (slower) (-27776 pps) 2. Early-drop : +7.55 ns (slower) (-66577 pps) With GCC 4.9.1, the new functions gets inlined, thus the refactor splitup of __netif_receive_skb_core() is basically "cancled". Strangly there is a small improvement for forwarding, likely due to some lucky assember reordering that give less icache/fetch-misses. The early-drop improvement is below accuracy levels, can cannot be trusted. With GCC 4.9.1: 1. IP-forwarding: -10.05ns (faster) (+17532 pps) 2. Early-drop : -1.54ns (faster) (+16216 pps) below accuracy levels Setup = On host: ivy Host ivy is the "sink" or DUT (Device Under Test). * CPU E5-2695ES @ 2.80GHz netfilter_unload_modules.sh netfilter_unload_modules.sh sudo rmmod nf_reject_ipv4 nf_reject_ipv6 base_device_setup.sh eth4 # 10G sink/receiving interface (ixgbe) base_device_setup.sh eth5 sudo ethtool --coalesce eth4 rx-usecs 30 Make a fake route to 198.18.0.0/15 out via eth5 sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0 sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5 Disable power saving to get more accurate measurements (see blogpost):: $ sudo tuned-adm active Current active profile: latency-performance Service tuned: enabled, running Service ktune: enabled, running Early drop in raw - alias iptables='sudo iptables' iptables -t raw -N simple || iptables -t raw -F simple iptables -t raw -I simple -d 198.18.0.0/15 -j DROP iptables -t raw -D PREROUTING -j simple iptables -t raw -I PREROUTING -j simple On host: dragon --- Host dragon is the packet generator. * 2x CPU E5-2630 0 @ 2.30GHz Generator NIC: eth8 - ixgbe 10G netfilter_unload_modules.sh netfilter_unload_modules.sh sudo rmmod nf_reject_ipv4 nf_reject_ipv6 base_device_setup.sh eth8 #
Re: [PATCH linux-next 1/4] infiniband/ipoib: fix possible NULL pointer dereference in ipoib_get_iflink
On Wed, Apr 15, 2015 at 09:17:14AM +0300, Erez Shitrit wrote: > >>+ /* parent interface */ > >>+ if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) > >>+ return dev->ifindex; > >>+ > >>+ /* child/vlan interface */ > >>+ if (!priv->parent) > >>+ return -1; > >Like was said for other drivers, I can't see how parent can be null > >while IPOIB_FLAG_SUBINTERFACE is set. Drop the last if. > It can, at least for ipoib child interface (AKA "vlan"), you can't > control the call for that ndo and it can be called before the parent > was set. If the ndo can be called before the netdev private structures are fully prepared then we have another bug, and returning -1 or 0 is not the right answer anyhow. For safety, fold this into your patch. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c index 9fad7b5ac8b9..e62b007adf5d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c @@ -58,6 +58,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, /* MTU will be reset when mcast join happens */ priv->dev->mtu = IPOIB_UD_MTU(priv->max_ib_mtu); priv->mcast_mtu = priv->admin_mtu = priv->dev->mtu; + priv->parent = ppriv->dev; set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags); result = ipoib_set_dev_features(priv, ppriv->ca); @@ -84,8 +85,6 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv, goto register_failed; } - priv->parent = ppriv->dev; - ipoib_create_debug_files(priv->dev); /* RTNL childs don't need proprietary sysfs entries */ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bpf: fix verifier memory corruption
On 4/15/15 8:59 AM, Hannes Frederic Sowa wrote: On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote: Due to missing bounds check the DAG pass of the BPF verifier can corrupt the memory which can cause random crashes during program loading: [8.449451] BUG: unable to handle kernel paging request at [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0 [8.452329] Oops: [#1] SMP [8.452329] Call Trace: [8.452329] [] bpf_check+0x852/0x2000 [8.452329] [] bpf_prog_load+0x1e4/0x310 [8.452329] [] ? might_fault+0x5f/0xb0 [8.452329] [] SyS_bpf+0x806/0xa30 Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier") Signed-off-by: Alexei Starovoitov --- Many things need to align for this crash to be seen, yet I managed to hit it. In my case JA was last insn, 't' was 255 and explored_states array had 256 elements. I've double checked other similar paths and all seems clean. kernel/bpf/verifier.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a28e09c7825d..36508e69e92a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1380,7 +1380,8 @@ peek_stack: /* tell verifier to check for equivalent states * after every call and jump */ - env->explored_states[t + 1] = STATE_LIST_MARK; + if (t + 1 < insn_cnt) + env->explored_states[t + 1] = STATE_LIST_MARK; } else { /* conditional jump with two edges */ ret = push_insn(t, t + 1, FALLTHROUGH, env); Quick review: We have env->explored_states[t+1] access in the } else { /* conditional jump with two edges */ ret = push_insn(t, t + 1, FALLTHROUGH, env); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; ret = push_insn(t, t + insns[t].off + 1, BRANCH, env); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; } } else { push_insn call. At this point insn[t].off could be 0, no? insn[t].off can be anything, but the first thing that push_insn() checks is: if (w < 0 || w >= env->prog->len) only then it does: env->explored_states[w] = STATE_LIST_MARK; so we're good there. Though thanks for triple checking :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 16/17] get rid of the size argument of sock_sendmsg()
From: David Laight Date: Wed, 15 Apr 2015 08:37:01 + > From: David Miller >> Sent: 14 April 2015 18:56 >> > On Tue, Apr 14, 2015 at 04:25:24PM +, David Laight wrote: >> >> From: Al Viro >> >> > Sent: 11 April 2015 22:18 >> >> > it's equal to iov_iter_count(&msg->msg_iter) in all cases >> >> >> >> I don't know whether this is guaranteed for iov[] that come from >> >> userspace. >> >> >> >> In any case iov_iter_count() is non-trivial and you don't >> >> really want to call it when unnecessary. >> > >> > Really? >> > >> > static inline size_t iov_iter_count(struct iov_iter *i) >> > { >> > return i->count; >> > } >> >> This just made my day. > > I hate accessor functions I hate people who jump to conclusions and don't actually read the code in question before commenting. :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] bpf: fix verifier memory corruption
On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote: > Due to missing bounds check the DAG pass of the BPF verifier can corrupt > the memory which can cause random crashes during program loading: > > [8.449451] BUG: unable to handle kernel paging request at > [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0 > [8.452329] Oops: [#1] SMP > [8.452329] Call Trace: > [8.452329] [] bpf_check+0x852/0x2000 > [8.452329] [] bpf_prog_load+0x1e4/0x310 > [8.452329] [] ? might_fault+0x5f/0xb0 > [8.452329] [] SyS_bpf+0x806/0xa30 > > Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier") > Signed-off-by: Alexei Starovoitov > --- > Many things need to align for this crash to be seen, yet I managed to hit it. > In my case JA was last insn, 't' was 255 and explored_states array > had 256 elements. I've double checked other similar paths and all seems clean. > > kernel/bpf/verifier.c |3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index a28e09c7825d..36508e69e92a 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -1380,7 +1380,8 @@ peek_stack: > /* tell verifier to check for equivalent states >* after every call and jump >*/ > - env->explored_states[t + 1] = STATE_LIST_MARK; > + if (t + 1 < insn_cnt) > + env->explored_states[t + 1] = STATE_LIST_MARK; > } else { > /* conditional jump with two edges */ > ret = push_insn(t, t + 1, FALLTHROUGH, env); Quick review: We have env->explored_states[t+1] access in the } else { /* conditional jump with two edges */ ret = push_insn(t, t + 1, FALLTHROUGH, env); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; >>> ret = push_insn(t, t + insns[t].off + 1, BRANCH, env); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; } } else { push_insn call. At this point insn[t].off could be 0, no? Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ip_tunnel: Remove gratuitous skb scrubbing
Le 15/04/2015 15:57, Herbert Xu a écrit : On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote: [snip] Subject: skbuff: Do not scrub skb mark within the same name space The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels: Maybe add a Fixes tag? Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path") harmonize cleanup done on skb on rx path") broke anyone trying to use netfilter marking across IPv4 tunnels. While most of the fields that are cleared by skb_scrub_packet don't matter, the netfilter mark must be preserved. This patch rearranges skb_scurb_packet to preserve the mark field. nit: s/scurb/scrub Else it's fine for me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH iproute2 -next] tc: built-in eBPF exec proxy
This work follows upon commit 6256f8c9e45f ("tc, bpf: finalize eBPF support for cls and act front-end") and takes up the idea proposed by Hannes Frederic Sowa to spawn a shell (or any other command) that holds generated eBPF map file descriptors. File descriptors, based on their id, are being fetched from the same unix domain socket as demonstrated in the bpf_agent, the shell spawned via execvpe(2) and the map fds passed over the environment, and thus are made available to applications in the fashion of std{in,out,err} for read/write access, for example in case of iproute2's examples/bpf/: # env | grep BPF BPF_NUM_MAPS=3 BPF_MAP1=6<- BPF_MAP_ID_QUEUE (id 1) BPF_MAP0=5<- BPF_MAP_ID_PROTO (id 0) BPF_MAP2=7<- BPF_MAP_ID_DROPS (id 2) # ls -la /proc/self/fd [...] lrwx--. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4 lrwx--. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4 lrwx--. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4 [...] lrwx--. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map lrwx--. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map lrwx--. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map The advantage (as opposed to the direct/native usage) is that now the shell is map fd owner and applications can terminate and easily reattach to descriptors w/o any kernel changes. Moreover, multiple applications can easily read/write eBPF maps simultaneously. To further allow users for experimenting with that, next step is to add a small helper that can get along with simple data types, so that also shell scripts can make use of bpf syscall, f.e to read/write into maps. Generally, this allows for prepopulating maps, or any runtime altering which could influence eBPF program behaviour (f.e. different run-time classifications, skb modifications, ...), dumping of statistics, etc. Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860 Suggested-by: Hannes Frederic Sowa Signed-off-by: Daniel Borkmann --- ( Stephen, this applies to current net-next branch of iproute2. ) examples/bpf/bpf_agent.c | 55 ++ examples/bpf/bpf_prog.c | 27 + tc/Makefile | 7 ++- tc/e_bpf.c | 147 +++ tc/f_bpf.c | 2 +- tc/m_action.c| 2 +- tc/m_bpf.c | 2 +- tc/tc.c | 9 +-- tc/tc_bpf.c | 98 --- tc/tc_bpf.h | 15 - tc/tc_common.h | 3 + tc/tc_exec.c | 109 +++ tc/tc_filter.c | 2 +- tc/tc_util.h | 16 -- 14 files changed, 454 insertions(+), 40 deletions(-) create mode 100644 tc/e_bpf.c create mode 100644 tc/tc_exec.c diff --git a/examples/bpf/bpf_agent.c b/examples/bpf/bpf_agent.c index 0f481b1..df09e0f 100644 --- a/examples/bpf/bpf_agent.c +++ b/examples/bpf/bpf_agent.c @@ -24,6 +24,8 @@ * -- Happy eBPF hacking! ;) */ +#define _GNU_SOURCE + #include #include #include @@ -31,6 +33,7 @@ #include #include #include + #include #include #include @@ -102,6 +105,23 @@ static void bpf_dump_proto(int fd) printf("\n"); } +static void bpf_dump_map_data(int *tfd) +{ + int i; + + for (i = 0; i < 30; i++) { + const int period = 5; + + printf("data, period: %dsec\n", period); + + bpf_dump_drops(tfd[BPF_MAP_ID_DROPS]); + bpf_dump_queue(tfd[BPF_MAP_ID_QUEUE]); + bpf_dump_proto(tfd[BPF_MAP_ID_PROTO]); + + sleep(period); + } +} + static void bpf_info_loop(int *fds, struct bpf_map_aux *aux) { int i, tfd[BPF_MAP_ID_MAX]; @@ -122,16 +142,22 @@ static void bpf_info_loop(int *fds, struct bpf_map_aux *aux) tfd[aux->ent[i].id] = fds[i]; } - for (i = 0; i < 30; i++) { - int period = 5; + bpf_dump_map_data(tfd); +} - printf("data, period: %dsec\n", period); +static void bpf_map_set_env(int *tfd) +{ + char key[64], *val; + int i; - bpf_dump_drops(tfd[BPF_MAP_ID_DROPS]); - bpf_dump_queue(tfd[BPF_MAP_ID_QUEUE]); - bpf_dump_proto(tfd[BPF_MAP_ID_PROTO]); + for (i = 0; i < BPF_MAP_ID_MAX; i++) { + memset(key, 0, sizeof(key)); + snprintf(key, sizeof(key), "BPF_MAP%d", i); - sleep(period); + val = secure_getenv(key); + assert(val != NULL); + + tfd[i] = atoi(val); } } @@ -186,9 +212,17 @@ int main(int argc, char **argv) struct sockaddr_un addr; int fd, ret, i; - if (argc < 2) { - fprintf(stderr, "Usage: %s \n", argv[0]); - exit(1); + /* When arguments are being passed, we take it as a path +* to a Unix domain socket, otherwise we gr
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote: > On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley [...] > > From a networking point of view, the backend is a switch. Is it OK to > > consider the packet to have been transmitted from the guest point of > > view once the backend is aware of the packet? > > > > This would help justify the skb_orphan() in the frontend. > > This sounds sensible to me, particularly if virtio_net is already doing it. I also find Malcolm's argument above pretty compelling. Ian. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/6] net: hip04: fix some problem for hip04 drivers
On Wednesday 15 April 2015 20:30:02 Ding Tianhong wrote: > The patches series was used to fix the issues of the hip04 driver, and added > some optimizations according to some good suggestion. > > Thanks, that looks much better, except for patch 4 that I commented on. I had at some point sent a patch that is archived at http://lists.infradead.org/pipermail/linux-arm-kernel/2015-January/318120.html I believe that one is also still needed, but I have not tested whether it is correct. Can you have a look at the patch from back then and see if it works, of if you find something wrong about it? I'm sending the unmodified patch from then here again for you to apply or comment. It will have to be rebased on top of your current changes. Arnd 8< Subject: [PATCH] net/hip04: refactor interrupt masking The hip04 ethernet driver currently acknowledges all interrupts directly in the interrupt handler, and leaves all interrupts except the RX data enabled the whole time. This causes multiple problems: - When more packets come in between the original interrupt and the NAPI poll function, we will get an extraneous RX interrupt as soon as interrupts are enabled again. - The two error interrupts are by definition combining all errors that may have happened since the last time they were handled, but just acknowledging the irq without dealing with the cause of the condition makes it come back immediately. In particular, when NAPI is intentionally stalling the rx queue, this causes a storm of "rx dropped" messages. - The access to the 'reg_inten' field in hip_priv is used for serializing access, but is in fact racy itself. To deal with these issues, the driver is changed to only acknowledge the IRQ at the point when it is being handled. The RX interrupts now get acked right before reading the number of received frames to keep spurious interrupts to the absolute minimum without losing interrupts. As there is no tx-complete interrupt, only an informational tx-dropped one, we now ack that in the tx reclaim handler, hoping that clearing the descriptors has also eliminated the error condition. The only place that reads the reg_inten now just relies on napi_schedule_prep() to return whether NAPI is active or not, and the poll() function is then going to ack and reenabled the IRQ if necessary. Finally, when we disable interrupts, they are now all disabled together, in order to simplify the logic and to avoid getting rx-dropped interrupts when NAPI is in control of the rx queue. Signed-off-by: Arnd Bergmann diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index 525214ef5984..83773247a368 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -56,6 +56,8 @@ #define RCV_DROP BIT(7) #define TX_DROPBIT(6) #define DEF_INT_ERR(RCV_NOBUF | RCV_DROP | TX_DROP) +#define DEF_INT_RX (RCV_INT | RCV_NOBUF | RCV_DROP) +#define DEF_INT_TX (TX_DROP) #define DEF_INT_MASK (RCV_INT | DEF_INT_ERR) /* TX descriptor config */ @@ -154,7 +156,6 @@ struct hip04_priv { unsigned int port; unsigned int speed; unsigned int duplex; - unsigned int reg_inten; struct napi_struct napi; struct net_device *ndev; @@ -303,17 +304,15 @@ static void hip04_mac_enable(struct net_device *ndev) val |= GE_RX_PORT_EN | GE_TX_PORT_EN; writel_relaxed(val, priv->base + GE_PORT_EN); - /* clear rx int */ - val = RCV_INT; - writel_relaxed(val, priv->base + PPE_RINT); + /* clear prior interrupts */ + writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT); /* config recv int */ val = GE_RX_INT_THRESHOLD | GE_RX_TIMEOUT; writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_INT); /* enable interrupt */ - priv->reg_inten = DEF_INT_MASK; - writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN); + writel_relaxed(DEF_INT_MASK, priv->base + PPE_INTEN); } static void hip04_mac_disable(struct net_device *ndev) @@ -322,8 +321,7 @@ static void hip04_mac_disable(struct net_device *ndev) u32 val; /* disable int */ - priv->reg_inten &= ~(DEF_INT_MASK); - writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN); + writel_relaxed(0, priv->base + PPE_INTEN); /* disable tx & rx */ val = readl_relaxed(priv->base + GE_PORT_EN); @@ -403,6 +401,8 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool force) priv->tx_tail = tx_tail; smp_wmb(); /* Ensure tx_tail visible to xmit */ + writel_relaxed(DEF_INT_TX, priv->base + PPE_RINT); + out: if (pkts_compl || bytes_compl) netdev_completed_queue(ndev, pkts_compl, bytes_compl); @@ -458,9 +458,7 @@ static int hip04_mac_start_xmit(
Re: [PATCH net-next 6/6] net: hip04: add ratelimit for rx/tx drops skb
On Wednesday 15 April 2015 20:30:08 Ding Tianhong wrote: > There can be quite a lot of rx/tx drops message and affect > useful message, so need to ratelimit them to not overwhelm > logging. > > Signed-off-by: Ding Tianhong > Cc: "David S. Miller" > Cc: Eric Dumazet > Cc: Arnd Bergmann > Cc: Zhangfei Gao > Cc: Dan Carpenter > Cc: Joe Perches > Acked-by: Arnd Bergmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley wrote: >> >> But the main concern here is it basically breaks back pressure. >> >> And you do not want this, unless there is no other choice. >> > > virtio_net already use's skb_orphan() in it's transmit path. It seems > only fair that other virtual network drivers behave in the same way. > > There are no easy solutions to decrease the transmit latency for > netback/netfront. We map the guest memory through to the backend to > avoid memory copies. The frontend memory can only be freed once the > network driver has completed transmitting the packet in the backend. > > Modern network drivers can be quite slow at freeing the skb's once > transmitted (the packet is already on the wire as far as they are > concerned) and this delay is compounded by needing the signal the > completion of the transmit back to the frontend (by IPI in worst case). > > From a networking point of view, the backend is a switch. Is it OK to > consider the packet to have been transmitted from the guest point of > view once the backend is aware of the packet? > > This would help justify the skb_orphan() in the frontend. This sounds sensible to me, particularly if virtio_net is already doing it. -George -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 4/6] net: hip04: Don't free the tx skb when the buffer is not ready for xmit
On Wednesday 15 April 2015 20:30:06 Ding Tianhong wrote: > @@ -489,6 +487,8 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, > struct net_device *ndev) > > /* Ensure tx_head update visible to tx reclaim */ > smp_wmb(); > + count++; > + priv->tx_head = TX_NEXT(tx_head); > > Something is wrong here, the comment does not match the code any more, because the smp_wmb() won't actually make the tx_head update visible. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/6] net: hip04: Solve the problem of the skb memory allocation failure
On Wednesday 15 April 2015 20:30:05 Ding Tianhong wrote: > The driver will alloc some skb buffer for hareware queue, but without > considering the case of memory allocation failure, when memory is low, > the skb may be null and panic the system, so break the loop when skb is > null and try to alloc the memory again to fix this problem. > > Signed-off-by: Ding Tianhong > Cc: "David S. Miller" > Cc: Eric Dumazet > Cc: Arnd Bergmann > Cc: Zhangfei Gao > Cc: Dan Carpenter > Cc: Joe Perches > Looks reasonable to me, though I haven't done a deep analysis to see if there are remaining problems in this area. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 5/6] net: hip04: remove the unnecessary check
On Wednesday 15 April 2015 20:30:07 Ding Tianhong wrote: > diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c > b/drivers/net/ethernet/hisilicon/hip04_eth.c > index ff9d19c..a7ab1d9 100644 > --- a/drivers/net/ethernet/hisilicon/hip04_eth.c > +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c > @@ -440,7 +440,7 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool > force) > smp_wmb(); /* Ensure tx_tail visible to xmit */ > > out: > - if (pkts_compl || bytes_compl) > + if (bytes_compl) > netdev_completed_queue(ndev, pkts_compl, bytes_compl); > > if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - > 1))) > -- > Acked-by: Arnd Bergmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc
On Wednesday 15 April 2015 20:30:04 Ding Tianhong wrote: > The hip04 ethernet use the big endian for tx and rx, so set desc to > big endian and remove the unused next_addr. I don't understand: > diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c > b/drivers/net/ethernet/hisilicon/hip04_eth.c > index b0a7f03..6473462 100644 > --- a/drivers/net/ethernet/hisilicon/hip04_eth.c > +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c > @@ -132,19 +132,18 @@ > #define HIP04_MIN_TX_COALESCE_FRAMES 1 > > struct tx_desc { > - u32 send_addr; > - u32 send_size; > - u32 next_addr; > - u32 cfg; > - u32 wb_addr; > + __be32 send_addr; > + __be32 send_size; > + __be32 cfg; > + __be32 wb_addr; > } __aligned(64); I would think this is a hardware structure, does this not break access to the cfg and wb_addr fields if you remove next_addr? Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/6] net: hip04: Set more appropriate value for tx coalesce num
On Wednesday 15 April 2015 20:30:03 Ding Tianhong wrote: > According Arnd's suggestion, the user might set the tx coalesce value > in large range for different workloads, so we should define them to > appropriate value. > > Signed-off-by: Ding Tianhong > Cc: "David S. Miller" > Cc: Eric Dumazet > Cc: Arnd Bergmann > Cc: Zhangfei Gao > Cc: Dan Carpenter > Cc: Joe Perches > Acked-by: Arnd Bergmann -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll
> > > > BTW, this looks quite generic - isn't it possible to take it out of > > the driver and push it into the networking core, uniforming it in the > > process? > > Yes, this is the plan I have in mind, once net-next is opened again ;) > > Thanks ! > I'm not familiar with any real strong test-suite for the busy poll, but I did try running all kinds of netperf connections with various busy_{poll,read} values and performance seemed reasonable, and verified that netconsole manages to work. So for all it's worth, Tested-by: Yuval Mintz N�r��yb�X��ǧv�^�){.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: ip_tunnel: Remove gratuitous skb scrubbing
On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote: > > Yes this is better. I'm currently auditing all the other bits > that are cleared to see if there is anything else that we should > preserve for tunneling. OK the only other thing that we may wish to preserve is secmark. James, can you confirm whether secmark should be preserved or cleared for tunnels within the same name space? Up until December 2014 it was preserved but since then it has been cleared. For the mark here is my final tested patch. ---8<--- Subject: skbuff: Do not scrub skb mark within the same name space The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels: harmonize cleanup done on skb on rx path") broke anyone trying to use netfilter marking across IPv4 tunnels. While most of the fields that are cleared by skb_scrub_packet don't matter, the netfilter mark must be preserved. This patch rearranges skb_scurb_packet to preserve the mark field. Signed-off-by: Herbert Xu diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3b6e583..a185427 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4124,19 +4124,22 @@ EXPORT_SYMBOL(skb_try_coalesce); */ void skb_scrub_packet(struct sk_buff *skb, bool xnet) { - if (xnet) - skb_orphan(skb); skb->tstamp.tv64 = 0; skb->pkt_type = PACKET_HOST; skb->skb_iif = 0; skb->ignore_df = 0; skb_dst_drop(skb); - skb->mark = 0; skb_sender_cpu_clear(skb); skb_init_secmark(skb); secpath_reset(skb); nf_reset(skb); nf_reset_trace(skb); + + if (!xnet) + return; + + skb_orphan(skb); + skb->mark = 0; } EXPORT_SYMBOL_GPL(skb_scrub_packet); -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet wrote: > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote: > >> Is the problem perhaps that netback/netfront delays TX completion? >> Would it be better to see if that can be addressed properly, so that >> the original purpose of the patch (fighting bufferbloat) can be >> achieved while not degrading performance for Xen? Or at least, so >> that people get decent perfomance out of the box without having to >> tweak TCP parameters? > > Sure, please provide a patch, that does not break back pressure. > > But just in case, if Xen performance relied on bufferbloat, it might be > very difficult to reach a stable equilibrium : Any small change in stack > or scheduling might introduce a significant difference in 'raw > performance'. So help me understand this a little bit here. tcp_limit_output_bytes limits the amount of data allowed to be "in-transit" between a send() and the wire, is that right? And so the "bufferbloat" problem you're talking about here are TCP buffers inside the kernel, and/or buffers in the NIC, is that right? So ideally, you want this to be large enough to fill the "pipeline" all the way from send() down to actually getting out on the wire; otherwise, you'll have gaps in the pipeline, and the machinery won't be working at full throttle. And the reason it's a problem is that many NICs now come with large send buffers; and effectively what happens then is that this makes the "pipeline" longer -- as the buffer fills up, the time between send() and the wire is increased. This increased latency causes delays in round-trip-times and interferes with the mechanisms TCP uses to try to determine what the actual sustainable rate of data trasmission is. By limiting the number of "in-transit" bytes, you make sure that neither the kernel nor the NIC are going to have packets queues up for long lengths of time in buffers, and you keep this "pipeline" as close to the actual minimal length of the pipeline as possible. And it sounds like for your 40G NIC, 128k is big enough to fill the pipeline without unduly making it longer by introducing buffering. Is that an accurate picture of what you're trying to achieve? But the problem for xennet (and a number of other drivers), as I understand it, is that at the moment the "pipeline" itself is just longer -- it just takes a longer time from the time you send a packet to the time it actually gets out on the wire. So it's not actually accurate to say that "Xen performance relies on bufferbloat". There's no buffering involved -- the pipeline is just longer, and so to fill up the pipeline you need more data. Basically, to maximize throughput while minimizing buffering, for *any* connection, tcp_limit_output_bytes should ideally be around (min_tx_latency * max_bandwidth). For physical NICs, the minimum latency is really small, but for xennet -- and I'm guessing for a lot of virtualized cards -- the min_tx_latency will be a lot higher, requiring a much higher ideal tcp_limit_output value. Rather than trying to pick a single value which will be good for all NICs, it seems like it would make more sense to have this vary depending on the parameters of the NIC. After all, for NICs that have low throughput -- say, old 100MiB NICs -- even 128k may still introduce a significant amount of buffering. Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case. Probably the most sustainable solution would be to have the networking layer keep track of the average and minimum transmit latencies, and automatically adjust tcp_limit_output_bytes based on that. (Keeping the minimum as well as the average because the whole problem with bufferbloat is that the more data you give it, the longer the apparent "pipeline" becomes.) Thoughts? -George -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] neighbour.c: Avoid GC directly after state change
No reply on this in net/core/neighbour.c: neigh_timer_handler I see: if (state & NUD_REACHABLE) { if (time_before_eq(now, neigh->confirmed + neigh->parms->reachable_time)) { neigh_dbg(2, "neigh %p is still alive\n", neigh); next = neigh->confirmed + neigh->parms->reachable_time; } else if (time_before_eq(now, neigh->used + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME))) { neigh_dbg(2, "neigh %p is delayed\n", neigh); neigh->nud_state = NUD_DELAY; neigh->updated = jiffies; neigh_suspect(neigh); next = now + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME); } else { neigh_dbg(2, "neigh %p is suspected\n", neigh); neigh->nud_state = NUD_STALE; neigh->updated = jiffies; neigh_suspect(neigh); notify = 1; } } else ... Why is the second test there in the first place? In the normal case, "neigh->used" does not get updated until the entry is found STALE in the periodic work. Why not use "neigh->confirmed" and another sysctl variable? if (state & NUD_REACHABLE) { if (time_before_eq(now, neigh->confirmed + neigh->parms->reachable_time)) { neigh_dbg(2, "neigh %p is still alive\n", neigh); next = neigh->confirmed + neigh->parms->reachable_time; } else if (time_before_eq(now, - neigh->used + - NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME))) { +neigh->confirmed + + NEIGH_VAR(neigh->parms, DELAY_REPROBE_TIME))) { neigh_dbg(2, "neigh %p is delayed\n", neigh); neigh->nud_state = NUD_DELAY; neigh->updated = jiffies; neigh_suspect(neigh); next = now + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME); } else { neigh_dbg(2, "neigh %p is suspected\n", neigh); neigh->nud_state = NUD_STALE; neigh->updated = jiffies; neigh_suspect(neigh); notify = 1; } } else ... if DELAY_REPROBE_TIME is larger than "reachable_time", then the kernel will send out ARP probes when it is about to lose communication with a remote machine. This is what we need. If it is smaller, then it will go from REACHABLE to STALE. The initial value of DELAY_REPROBE_TIME needs to be settable in Kconfig to allow the selection of functionality. I am told that setting stuff using sysctl has a performance penalty, when interfaces are dynamically created and deleted in hundreds. Best Regards, Ulf Samuelsson On 04/10/2015 10:26 AM, Ulf Samuelsson wrote: On 03/12/2015 07:26 PM, David Miller wrote: I hate changes like this. By making this a Kconfig options it cannot be dynamic, and every distribution is going to have to scratch their head and decide what to set this to. That's seriously suboptimal. If you want to change behavior based upon whether userspace is managing the damn table, make it so the user doing so has to ask for the new behavior at _RUN TIME_ via the netlink interface or similar. Picking the guard time itself at compile time is also undesirable. And you don't even want a damn timer, what you want is for the state of the entry to be frozen and for the user to "release" it by either adjusting the state to something else or marking in some other way to allow it to be unfrozen and released again. Why put it to chance with some timeout? Make things explicit. The desired functionality is that if communication stops, you want to send out ARP probes, before the entry is deleted. The current (pseudo) code of the neigh timer is: if (state & NUD_REACHABLE) { if (now <= "confirmed + "reachable_time")) { ... /* We are OK */ } else if (now < "used" + DELAY_PROBE_TIME) {/* Never happens */ state = NUD_DELAY; } else { state = NUD_STALE; notify = 1; } We never see the state beeing changed from REACHABLE to DELAY, so the probes are not beeing sent out, instead you always go from REACHABLE to STALE. DELAY_PROBE_TIME is set to (5 x HZ) and "used" seems to be only set by the periodic_work routine when the neigh entry is in STALE state, and then it is too late. It is also set by "arp_find" which is used by "broken" devices. In practice, the second condition: "(now < "used" + DELAY_PROBE_TIME)" is never used. What is the intention of this test? By adding a new test + parameter, we would get the desired functionality, and no need to listen for notifications or doing ARP state updates from applications. if (now <= "confirmed + "reachable_time")) { ... /* We are OK */ +else if (now <= "confirmed + "reprobe_time")) { + state <= NUD_DELAY; } else if (now < "used
Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
On 04/15/2015 05:44 AM, David Laight wrote: From: Alexander Duyck Sent: 10 April 2015 20:56 On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote: +another_round: + ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev); + switch (ret) { + case NET_RX_SUCCESS: + case NET_RX_DROP: + break; + case __NET_RX_ANOTHER_ROUND: + goto another_round; + } rcu_read_unlock(); + return ret; } Couldn't this just be done as a do while? It would probably be easier to read and there wouldn't be any need for the another_round label anymore. Or an infinite loop with a break at the bottom, as in: for (;;) { switch (...) { case again: continue; default: break; } break; } David That is even more complicated. What I was thinking was do { ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev); } while (ret == __NET_RX_ANOTHER_ROUND); Either that or the switch could just be replaced with a if statement since the only case that really goes anywhere is __NET_RX_ANOTHER_ROUND and everything else just exits anyway. I had just suggested a do/while since that lets the goto be dropped, but an if would allow for avoiding any unnecessary indentation on the call to __netif_receive_skb_ingress. - Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2 1/4] cxgb4/iw_cxgb4/cxgb4i: remove duplicate definitions
On Mon, Apr 13, 2015 at 10:38:08AM -0600, Jason Gunthorpe wrote: > On Mon, Apr 13, 2015 at 07:34:23PM +0530, Varun Prakash wrote: > > define struct ulptx_idata in common header file t4_msg.h > > to remove duplicate definitions. > > The Infiniband side of this patch looks OK. > > Reviewed-By: Jason Gunthorpe > > Just some random thoughts on the other patches: > - Try and use 'if (IS_ENABLED(CONFIG_XX))' over #ifdef >to improve compile test coverage. This would drop a fair number >of ifdefs. FCoE specific structures and functions are defined only if CONFIG_CHELSIO_T4_FCOE is enabled. #ifdef CONFIG_CHELSIO_T4_FCOE void cxgb_fcoe_init_ddp(struct adapter *adap) { ... } #endif If CONFIG_CHELSIO_T4_FCOE is disabled then following code will result in build error "implicit declaration of function cxgb_fcoe_init_ddp" if (IS_ENABLED(CONFIG_CHELSIO_T4_FCOE)) cxgb_fcoe_init_ddp(adap); > - Some of the commit message are short, or non existant (ie #4) > - Generally, no need for 'static inline' in a .c file, the compiler knows >what to do. > > Regards, > Jason > Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/7] net: refactor __netif_receive_skb_core
From: Alexander Duyck > Sent: 10 April 2015 20:56 > On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote: > > +another_round: > > + ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev); > > + switch (ret) { > > + case NET_RX_SUCCESS: > > + case NET_RX_DROP: > > + break; > > + case __NET_RX_ANOTHER_ROUND: > > + goto another_round; > > + } > > rcu_read_unlock(); > > + > > return ret; > > } > > > > > > Couldn't this just be done as a do while? It would probably be easier > to read and there wouldn't be any need for the another_round label anymore. Or an infinite loop with a break at the bottom, as in: for (;;) { switch (...) { case again: continue; default: break; } break; } David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/6] net: hip04: Set more appropriate value for tx coalesce num
According Arnd's suggestion, the user might set the tx coalesce value in large range for different workloads, so we should define them to appropriate value. Signed-off-by: Ding Tianhong Cc: "David S. Miller" Cc: Eric Dumazet Cc: Arnd Bergmann Cc: Zhangfei Gao Cc: Dan Carpenter Cc: Joe Perches --- drivers/net/ethernet/hisilicon/hip04_eth.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index b72d238..b0a7f03 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -126,10 +126,10 @@ #define DRV_NAME "hip04-ether" #define DRV_VERSION"v1.0" -#define HIP04_MAX_TX_COALESCE_USECS200 -#define HIP04_MIN_TX_COALESCE_USECS100 -#define HIP04_MAX_TX_COALESCE_FRAMES 200 -#define HIP04_MIN_TX_COALESCE_FRAMES 100 +#define HIP04_MAX_TX_COALESCE_USECS10 +#define HIP04_MIN_TX_COALESCE_USECS1 +#define HIP04_MAX_TX_COALESCE_FRAMES (TX_DESC_NUM - 1) +#define HIP04_MIN_TX_COALESCE_FRAMES 1 struct tx_desc { u32 send_addr; -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc
The hip04 ethernet use the big endian for tx and rx, so set desc to big endian and remove the unused next_addr. Signed-off-by: Ding Tianhong Cc: "David S. Miller" Cc: Eric Dumazet Cc: Arnd Bergmann Cc: Zhangfei Gao Cc: Dan Carpenter Cc: Joe Perches --- drivers/net/ethernet/hisilicon/hip04_eth.c | 19 +-- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index b0a7f03..6473462 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -132,19 +132,18 @@ #define HIP04_MIN_TX_COALESCE_FRAMES 1 struct tx_desc { - u32 send_addr; - u32 send_size; - u32 next_addr; - u32 cfg; - u32 wb_addr; + __be32 send_addr; + __be32 send_size; + __be32 cfg; + __be32 wb_addr; } __aligned(64); struct rx_desc { - u16 reserved_16; - u16 pkt_len; - u32 reserve1[3]; - u32 pkt_err; - u32 reserve2[4]; + __be16 reserved_16; + __be16 pkt_len; + __be32 reserve1[3]; + __be32 pkt_err; + __be32 reserve2[4]; }; struct hip04_priv { -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/6] net: hip04: Solve the problem of the skb memory allocation failure
The driver will alloc some skb buffer for hareware queue, but without considering the case of memory allocation failure, when memory is low, the skb may be null and panic the system, so break the loop when skb is null and try to alloc the memory again to fix this problem. Signed-off-by: Ding Tianhong Cc: "David S. Miller" Cc: Eric Dumazet Cc: Arnd Bergmann Cc: Zhangfei Gao Cc: Dan Carpenter Cc: Joe Perches --- drivers/net/ethernet/hisilicon/hip04_eth.c | 68 -- 1 file changed, 54 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index 6473462..7533858 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -131,6 +131,8 @@ #define HIP04_MAX_TX_COALESCE_FRAMES (TX_DESC_NUM - 1) #define HIP04_MIN_TX_COALESCE_FRAMES 1 +#define HIP04_RX_BUFFER_WRITE 16 + struct tx_desc { __be32 send_addr; __be32 send_size; @@ -180,6 +182,7 @@ struct hip04_priv { /* written only by tx cleanup */ unsigned int tx_tail cacheline_aligned_in_smp; + unsigned int rx_tail cacheline_aligned_in_smp; }; static inline unsigned int tx_count(unsigned int head, unsigned int tail) @@ -187,6 +190,11 @@ static inline unsigned int tx_count(unsigned int head, unsigned int tail) return (head - tail) % (TX_DESC_NUM - 1); } +static inline unsigned int rx_count(unsigned int head, unsigned int tail) +{ + return (head - tail) % (RX_DESC_NUM - 1); +} + static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex) { struct hip04_priv *priv = netdev_priv(ndev); @@ -363,6 +371,35 @@ static int hip04_set_mac_address(struct net_device *ndev, void *addr) return 0; } +static int hip04_alloc_rx_buffers(struct net_device *ndev, int cleaned_count) +{ + struct hip04_priv *priv = netdev_priv(ndev); + unsigned char *buf; + dma_addr_t phys; + int i = priv->rx_tail; + + while (cleaned_count) { + buf = netdev_alloc_frag(priv->rx_buf_size); + if (!buf) + break; + + phys = dma_map_single(&ndev->dev, buf, + RX_BUF_SIZE, DMA_FROM_DEVICE); + if (dma_mapping_error(&ndev->dev, phys)) + break; + + priv->rx_buf[i] = buf; + priv->rx_phys[i] = phys; + hip04_set_recv_desc(priv, phys); + i = RX_NEXT(i); + cleaned_count--; + } + + priv->rx_tail = i; + + return 0; +} + static int hip04_tx_reclaim(struct net_device *ndev, bool force) { struct hip04_priv *priv = netdev_priv(ndev); @@ -482,8 +519,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget) struct sk_buff *skb; unsigned char *buf; bool last = false; - dma_addr_t phys; - int rx = 0; + int rx = 0, cleaned_count = 0; int tx_remaining; u16 len; u32 err; @@ -491,8 +527,10 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget) while (cnt && !last) { buf = priv->rx_buf[priv->rx_head]; skb = build_skb(buf, priv->rx_buf_size); - if (unlikely(!skb)) + if (unlikely(!skb)) { net_dbg_ratelimited("build_skb failed\n"); + goto done; + } dma_unmap_single(&ndev->dev, priv->rx_phys[priv->rx_head], RX_BUF_SIZE, DMA_FROM_DEVICE); @@ -519,18 +557,15 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget) rx++; } - buf = netdev_alloc_frag(priv->rx_buf_size); - if (!buf) - goto done; - phys = dma_map_single(&ndev->dev, buf, - RX_BUF_SIZE, DMA_FROM_DEVICE); - if (dma_mapping_error(&ndev->dev, phys)) - goto done; - priv->rx_buf[priv->rx_head] = buf; - priv->rx_phys[priv->rx_head] = phys; - hip04_set_recv_desc(priv, phys); - priv->rx_head = RX_NEXT(priv->rx_head); + + cleaned_count = rx_count(priv->rx_head, priv->rx_tail); + /* return some buffers to hardware , one at a time is too slow */ + if (++cleaned_count >= HIP04_RX_BUFFER_WRITE) { + hip04_alloc_rx_buffers(ndev, cleaned_count); + cleaned_count = 0; + } + if (rx >= budget) goto done; @@ -545,6 +580,10 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget) } napi_complete(napi); done: + cleaned_count = rx_count(priv->rx_head, priv->rx_tail); + if (cleaned_co
[PATCH net-next 5/6] net: hip04: remove the unnecessary check
Testing bytes_compl should be enough, because there is no way that pkt_compl could be 0 if bytes_compl is not 0. Signed-off-by: Ding Tianhong Cc: "David S. Miller" Cc: Eric Dumazet Cc: Arnd Bergmann Cc: Zhangfei Gao Cc: Dan Carpenter Cc: Joe Perches --- drivers/net/ethernet/hisilicon/hip04_eth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c index ff9d19c..a7ab1d9 100644 --- a/drivers/net/ethernet/hisilicon/hip04_eth.c +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c @@ -440,7 +440,7 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool force) smp_wmb(); /* Ensure tx_tail visible to xmit */ out: - if (pkts_compl || bytes_compl) + if (bytes_compl) netdev_completed_queue(ndev, pkts_compl, bytes_compl); if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1))) -- 1.8.0 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html