date:20150415

Re: [PATCH net-next 4/6] net: hip04: Don't free the tx skb when the buffer is not ready for xmit

2015-04-15 Thread Ding Tianhong

On 2015/4/15 22:19, Arnd Bergmann wrote:
> On Wednesday 15 April 2015 20:30:06 Ding Tianhong wrote:
>> @@ -489,6 +487,8 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, 
>> struct net_device *ndev)
>>  
>> /* Ensure tx_head update visible to tx reclaim */
>> smp_wmb();
>> +   count++;
>> +   priv->tx_head = TX_NEXT(tx_head);
>>  
>>
> 
> Something is wrong here, the comment does not match the code any more, because
> the smp_wmb() won't actually make the tx_head update visible.
> 
>   Arnd
> 
Yes, the smp_wmb() could only make sure the tx buffer was ready to xmit.

> .
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc

2015-04-15 Thread Ding Tianhong

On 2015/4/15 22:13, Arnd Bergmann wrote:
> On Wednesday 15 April 2015 20:30:04 Ding Tianhong wrote:
>> The hip04 ethernet use the big endian for tx and rx, so set desc to
>> big endian and remove the unused next_addr.
> 
> I don't understand:
> 
>> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
>> b/drivers/net/ethernet/hisilicon/hip04_eth.c
>> index b0a7f03..6473462 100644
>> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
>> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
>> @@ -132,19 +132,18 @@
>>  #define HIP04_MIN_TX_COALESCE_FRAMES   1
>>  
>>  struct tx_desc {
>> -   u32 send_addr;
>> -   u32 send_size;
>> -   u32 next_addr;
>> -   u32 cfg;
>> -   u32 wb_addr;
>> +   __be32 send_addr;
>> +   __be32 send_size;
>> +   __be32 cfg;
>> +   __be32 wb_addr;
>>  } __aligned(64);
> 
> I would think this is a hardware structure, does this not break
> access to the cfg and wb_addr fields if you remove next_addr?
> 
>   Arnd
> 
good cache, I make a mistake here, thanks.

> .
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] net: hip04: fix some problem for hip04 drivers

2015-04-15 Thread Ding Tianhong

On 2015/4/15 22:25, Arnd Bergmann wrote:
> On Wednesday 15 April 2015 20:30:02 Ding Tianhong wrote:
>> The patches series was used to fix the issues of the hip04 driver, and added
>> some optimizations according to some good suggestion. 
>>
>>
> 
> Thanks, that looks much better, except for patch 4 that I commented on.
> 
I will check and fix it.

> I had at some point sent a patch that is archived at
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-January/318120.html
> 
> I believe that one is also still needed, but I have not tested whether it
> is correct. Can you have a look at the patch from back then and see if it
> works, of if you find something wrong about it?
> 
> I'm sending the unmodified patch from then here again for you to apply
> or comment. It will have to be rebased on top of your current changes.
> 
>   Arnd
> 

Thanks for the suggestion, I notices that I miss this patch in my series from 
the maillist, 
I need time to check the code and try to test it, thanks for the work.:)

Ding

> 8<
> Subject: [PATCH] net/hip04: refactor interrupt masking
> 
> The hip04 ethernet driver currently acknowledges all interrupts directly
> in the interrupt handler, and leaves all interrupts except the RX data
> enabled the whole time. This causes multiple problems:
> 
> - When more packets come in between the original interrupt and the
>   NAPI poll function, we will get an extraneous RX interrupt as soon
>   as interrupts are enabled again.
> 
> - The two error interrupts are by definition combining all errors that
>   may have happened since the last time they were handled, but just
>   acknowledging the irq without dealing with the cause of the condition
>   makes it come back immediately. In particular, when NAPI is intentionally
>   stalling the rx queue, this causes a storm of "rx dropped" messages.  
> 
> - The access to the 'reg_inten' field in hip_priv is used for serializing
>   access, but is in fact racy itself.
> 
> To deal with these issues, the driver is changed to only acknowledge
> the IRQ at the point when it is being handled. The RX interrupts now get
> acked right before reading the number of received frames to keep spurious
> interrupts to the absolute minimum without losing interrupts.
> 
> As there is no tx-complete interrupt, only an informational tx-dropped
> one, we now ack that in the tx reclaim handler, hoping that clearing
> the descriptors has also eliminated the error condition.
> 
> The only place that reads the reg_inten now just relies on
> napi_schedule_prep() to return whether NAPI is active or not, and
> the poll() function is then going to ack and reenabled the IRQ if
> necessary.
> 
> Finally, when we disable interrupts, they are now all disabled
> together, in order to simplify the logic and to avoid getting
> rx-dropped interrupts when NAPI is in control of the rx queue.
> 
> Signed-off-by: Arnd Bergmann 
> 
> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
> b/drivers/net/ethernet/hisilicon/hip04_eth.c
> index 525214ef5984..83773247a368 100644
> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -56,6 +56,8 @@
>  #define RCV_DROP   BIT(7)
>  #define TX_DROPBIT(6)
>  #define DEF_INT_ERR(RCV_NOBUF | RCV_DROP | TX_DROP)
> +#define DEF_INT_RX (RCV_INT | RCV_NOBUF | RCV_DROP)
> +#define DEF_INT_TX (TX_DROP)
>  #define DEF_INT_MASK   (RCV_INT | DEF_INT_ERR)
>  
>  /* TX descriptor config */
> @@ -154,7 +156,6 @@ struct hip04_priv {
> unsigned int port;
> unsigned int speed;
> unsigned int duplex;
> -   unsigned int reg_inten;
>  
> struct napi_struct napi;
> struct net_device *ndev;
> @@ -303,17 +304,15 @@ static void hip04_mac_enable(struct net_device *ndev)
> val |= GE_RX_PORT_EN | GE_TX_PORT_EN;
> writel_relaxed(val, priv->base + GE_PORT_EN);
>  
> -   /* clear rx int */
> -   val = RCV_INT;
> -   writel_relaxed(val, priv->base + PPE_RINT);
> +   /* clear prior interrupts */
> +   writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
>  
> /* config recv int */
> val = GE_RX_INT_THRESHOLD | GE_RX_TIMEOUT;
> writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_INT);
>  
> /* enable interrupt */
> -   priv->reg_inten = DEF_INT_MASK;
> -   writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +   writel_relaxed(DEF_INT_MASK, priv->base + PPE_INTEN);
>  }
>  
>  static void hip04_mac_disable(struct net_device *ndev)
> @@ -322,8 +321,7 @@ static void hip04_mac_disable(struct net_device *ndev)
> u32 val;
>  
> /* disable int */
> -   priv->reg_inten &= ~(DEF_INT_MASK);
> -   writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +   writel_relaxed(0, priv->base + PPE_INTEN);
>  
> /* disable tx & rx */
>

Re: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread Patrick McHardy

On 15.04, Jesper Dangaard Brouer wrote:
> > Out of curiosity, what is actually the performance impact on all
> > of this? We were just arguing on a different matter on two more
> > instructions in the fast-path, here it's refactoring the whole
> > function into several ones, I presume gcc won't inline it.
> 
> Pablo asked me to performance test this change.  Full test report below.
> 
> The performance effect (of this patch) depend on the Gcc compiler
> version.
> 
> Two tests:
>  1. IP-forwarding (unloaded netfilter modules)
>  2. Early drop in iptables "raw" table
> 
> With GCC 4.4.7, which does not inline the new functions
> (__netif_receive_skb_ingress and __netif_receive_skb_finish) the
> performance impact/regression is definitly measurable.
> 
> With GCC 4.4.7:
>  1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
>  2. Early-drop   :  +7.55 ns (slower) (-66577 pps)
> 
> With GCC 4.9.1, the new functions gets inlined, thus the refactor
> splitup of __netif_receive_skb_core() is basically "cancled".
> Strangly there is a small improvement for forwarding, likely due to
> some lucky assember reordering that give less icache/fetch-misses.
> The early-drop improvement is below accuracy levels, can cannot be
> trusted.
> 
> With GCC 4.9.1:
>  1. IP-forwarding: -10.05ns (faster) (+17532 pps)
>  2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels
> 
> I don't know what to conclude, as the result depend on the compiler
> version... but these kind of change do affect performance, and should
> be tested/measured.

Thanks Jesper. This effect without inlinging was to be expected I guess.
The interesting question would be a patch that uses nf_hook() without okfn
callback, moves the ingress qdisc to register with the netfilter ingress
hook and moves the TTL tracking of ing_filter() to the ingress qdisc,
where it belongs.

My expectation would be that this would result in an overall performance
gain.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Patrick McHardy

On 16.04, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
> >
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> Perhaps we should provide better helpers to facilitate this?
> 
> So instead of directly manipulating the content of the skb you
> would so so through helpers and the helpers can then try to do
> sensible things with the fragments.

Yeah, but its going to get pretty complicated. There can be multiple
size changes and modifications for every packet, so we would need to
keep track of the size modifications that have already occured to
map to the correct position in the frag_list. The modifications
would then have to be performed on both the reassembled skb since
they might again be matched against later on and on the frag_list,
potentially split over multiple skbs.

When enlarging the packet, I guess we would insert a new (small)
fragment to keep the geometry of the original fragments. In extreme
cases like H.323 NAT this might result in a huge amount of new very
small fragments. And it might still break the geometry when the
size difference isn't representable as a multiple of 8.

When reducing the skb size, we might have to choice but to change
the geometry of at least on fragment and would have to shift a lot
of data around.

This is very complicated, it we really want to do this, its a lot
easier to just keep note of the full original geometry and refragment
to those exact sizes while potentially adding or removing things
at the end.

My personal opinion is, why bother. The only thing that cares about
the fragment sizes is PMTUD, and that's what this patch is fixing in
a much simpler fashion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Herbert Xu

On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
>
> Netfilter may change the contents of the packet, even change its size.
> It is *really* hard to do this while keeping the original fragments
> intact.

Perhaps we should provide better helpers to facilitate this?

So instead of directly manipulating the content of the skb you
would so so through helpers and the helpers can then try to do
sensible things with the fragments.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Patrick McHardy

On 16.04, Herbert Xu wrote:
> David Miller  wrote:
> > 
> > Because then there is no ambiguity at all, you preserve on output
> > exactly what you had on input.  The same geometry, the same
> > everything.  No special checks, no max frag len, none of this crap.
> > Those are all hacks trying to work around the _fundamental_ issue
> > which is that we potentially change the thing when we refrag.
> 
> Agreed.  Doing anything other than preserving the original geometry
> is simply wrong.
> 
> However, this doesn't mean that netfilter has to process each
> fragment.  What we could do is to preserve the original fragments
> in frag_list and then process the overall skb as a unit in netfilter.
> 
> On output we simply fragment according to the original frag_list.
> 
> The only thing to watch out for is to eliminate anything in the
> middle that tries to linearise the skb.

Netfilter may change the contents of the packet, even change its size.
It is *really* hard to do this while keeping the original fragments
intact.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] neighbour.c: Avoid GC directly after state change

2015-04-15 Thread YOSHIFUJI Hideaki

Hi,

Ulf Samuelsson wrote:

> The desired functionality is that if communication stops,
> you want to send out ARP probes, before the entry is deleted.
> 
> The current (pseudo) code of the neigh timer is:
> 
> if (state & NUD_REACHABLE) {
> if (now <= "confirmed + "reachable_time")) {
> ... /* We are OK */
> } else if (now < "used" + DELAY_PROBE_TIME) {/* Never happens */
> state = NUD_DELAY;
> } else {
> state = NUD_STALE;
> notify = 1;
> }
> 
> We never see the state beeing changed from REACHABLE to DELAY,
> so the probes are not beeing sent out, instead you always go
> from REACHABLE to STALE.

That's right.

> DELAY_PROBE_TIME is set to (5 x HZ) and "used"
> seems to be only set by the periodic_work routine
> when the neigh entry is in STALE state, and then it is too late.
> It is also set by "arp_find" which is used by "broken" devices.
> 

In STALE state, neigh->used is set by neigh_event_send(), called
by neigh_resolve_output() via neigh->output().

> In practice, the second condition: "(now < "used" + DELAY_PROBE_TIME)" is 
> never used.
> What is the intention of this test?

That's right.  It is NOT used in normal condition unless
reachable time is too short.

> 
> By adding a new test + parameter, we would get the desired functionality,
> and no need to listen for notifications or doing ARP state updates from 
> applications.
> 
> if (now <= "confirmed + "reachable_time")) {
> ... /* We are OK */
> +else if (now <= "confirmed + "reprobe_time")) {
> +   state <= NUD_DELAY;
> } else if (now < "used" + DELAY_PROBE_TIME))) {/* Never happens */
> state <= NUD_DELAY;
> } else {
> state = NUD_STALE;
> notify = 1;
> }
> 
> This way the entry would remain in REACHABLE while normal communication 
> occurs,
> then it would enter DELAY state to probe, and if that fails, it goes to STALE 
> state.

No, it is not what REACHABLE and DELAY mean.

>From RFC2461:

|  REACHABLE   Roughly speaking, the neighbor is known to have been
|  reachable recently (within tens of seconds ago).
:
|  STALE   The neighbor is no longer known to be reachable but
|  until traffic is sent to the neighbor, no attempt
|  should be made to verify its reachability.
|  DELAY   The neighbor is no longer known to be reachable, and
|  traffic has recently been sent to the neighbor.
|  Rather than probe the neighbor immediately, however,
|  delay sending probes for a short while in order to
|  give upper layer protocols a chance to provide
|  reachability confirmation.

-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] dsa: mv88e6xxx: Fix error handling in mv88e6xxx_set_port_state

2015-04-15 Thread Guenter Roeck

Return correct error code if _mv88e6xxx_reg_read returns an error.

Fixes: facd95b2e0ec0 ("net: dsa: mv88e6xxx: Add Hardware bridging support")
Signed-off-by: Guenter Roeck 
---
 drivers/net/dsa/mv88e6xxx.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index fc8d3b6ffe8e..f64186a5f634 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -908,8 +908,10 @@ static int mv88e6xxx_set_port_state(struct dsa_switch *ds, 
int port, u8 state)
mutex_lock(&ps->smi_mutex);
 
reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_CONTROL);
-   if (reg < 0)
+   if (reg < 0) {
+   ret = reg;
goto abort;
+   }
 
oldstate = reg & PORT_CONTROL_STATE_MASK;
if (oldstate != state) {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Herbert Xu

David Miller  wrote:
> 
> Because then there is no ambiguity at all, you preserve on output
> exactly what you had on input.  The same geometry, the same
> everything.  No special checks, no max frag len, none of this crap.
> Those are all hacks trying to work around the _fundamental_ issue
> which is that we potentially change the thing when we refrag.

Agreed.  Doing anything other than preserving the original geometry
is simply wrong.

However, this doesn't mean that netfilter has to process each
fragment.  What we could do is to preserve the original fragments
in frag_list and then process the overall skb as a unit in netfilter.

On output we simply fragment according to the original frag_list.

The only thing to watch out for is to eliminate anything in the
middle that tries to linearise the skb.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

2015-04-15 Thread Hyong-Youb Kim

On Thu, Apr 16, 2015 at 01:58:16AM +0200, Luis R. Rodriguez wrote:
> 
> An alternative... is to just ioremap_wc() the entire region, including
> MMIO registers for these old devices. I see one ethernet driver that does
> this, myri10ge, and am curious how and why they ended up deciding this
> and if they have run into any issues. I wonder if this is a reasonable
> comrpomise for these 2 remaining corner cases.
> 

For myri10ge, it a performance thing.  Descriptor rings are in NIC
memory BAR0, not in host memory.  Say, to send a packet, the driver
writes the send descriptor to the ioremap'd NIC memory.  It is a
multi-word descriptor.  So, to send it as one PCIE MWr transaction,
the driver maps the whole BAR0 as WC and does "copy descriptor; wmb".

Without WC, descriptors would end up as multiple 4B or 8B MWr packets
to the NIC, which has a pretty big performance impact on this
particular NIC.

Most registers that do not want WC are actually in BAR2, which is not
mapped as WC.  For registers that are in BAR0, we do "write to the
register; wmb".  If we want to wait till the NIC has seen the write,
we do "write; wmb; read".

This approach has worked for this device for many years.  I cannot say
whether it works for other devices, though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Thu, 2015-04-16 at 12:20 +0800, Herbert Xu wrote:
> Eric Dumazet  wrote:
> >
> > We already have netdev->gso_max_size and netdev->gso_max_segs
> > which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs
> 
> It is quite dangerous to attempt tricks like this because a
> tc redirection or netfilter nat could change the destination
> device rendering such hints incorrect.

Right but we are talking of performance hints, on quite basic VM setup.

Here the guest would use xen and this hint would apply.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Herbert Xu

Eric Dumazet  wrote:
>
> We already have netdev->gso_max_size and netdev->gso_max_segs
> which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

It is quite dangerous to attempt tricks like this because a
tc redirection or netfilter nat could change the destination
device rendering such hints incorrect.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more messages]

2015-04-15 Thread Prashant



On 4/15/2015 3:54 AM, Ian Jackson wrote:

Prashant writes ("Re: tg3 NIC driver bug in 3.14.x under Xen [and 3 more 
messages]"):

I tried to reproduce the problem on 32 bit 3.14.34 stable kernel
baremetal, with iommu=soft swiotlb=force but no luck, no drops or
errors. I did not try with Xen 64 bit yet. Btw I need a pcie analyzer
trace to confirm the problem. Is it feasible to capture at your end ?


In private correspondence with Prashant we have established that
Prashant was using a different kernel configuration.  Prashant
provided me with their kernel and module binaries, which work in my
environment.

I have also established that I can reproduce the problem with 3.14.37
(`iommu=soft swiotlb=force' baremetal => all rx wholly corrupted).

The kernel config I was using is here:
   
http://logs.test-lab.xenproject.org/osstest/logs/50387/build-i386-pvops/build/config

As far as I am aware no-one at Broadcom has attempted a build and test
using my kernel config.



Ian, using your config we are able to recreate the problem that you are 
seeing. The driver finds the RX data buffer to be all zero, with a 
analyzer trace we are seeing the chip is DMA'ing valid RX data buffer 
contents to the host but once the driver tries to read this DMA area, it 
is seeing all zero's which is the reason of the corruption. This is only 
for the RX data buffer, the RX descriptor and status block update DMA 
regions are having valid contents.


This is unlikely to be a chip or driver issue, as the chip is doing the 
correct DMA but the corruption occurs before driver reads it. Would 
request iommu experts to take a look and suggest what can be done next.


We are more than willing to try any changes in the driver, I have added 
few more team members who will work with you if needed. Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

2015-04-15 Thread Andy Walls

On Thu, 2015-04-16 at 01:58 +0200, Luis R. Rodriguez wrote:
> Hey Andy, thanks for your review,  adding Hyong-Youb Kim for  review of the
> full range ioremap_wc() idea below.
> 
> On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote:
> > Hi All,
> > 
> > On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> > > From the beginning it seems only framebuffer devices used MTRR/WC,
> > [snip]
> > >  The ivtv device is a good example of the worst type of
> > > situations and these days. So perhap __arch_phys_wc_add() and a
> > > ioremap_ucminus() might be something more than transient unless hardware 
> > > folks
> > > get a good memo or already know how to just Do The Right Thing (TM).
> > 
> > Just to reiterate a subtle point, use of the ivtvfb is *optional*.  A
> > user may or may not load it.  When the user does load the ivtvfb driver,
> > the ivtv driver has already been initialized and may have functions of
> > the card already in use by userspace.
> 
> I suspected this and its why I note that a rewrite to address a clean
> split with separate ioremap seems rather difficult in this case.
> 
> > Hopefully no one is trying to use the OSD as framebuffer and the video
> > decoder/output engine for video display at the same time. 
> 
> Worst case concern I have also is the implications of having overlapping
> ioremap() calls (as proposed in my last reply) for different memory types
> and having the different virtual memory addresse used by different parts
> of the driver. Its not clear to me what the hardware implications of this
> are.
> 
> >  But the video
> > decoder/output device nodes may already be open for performing ioctl()
> > functions so unmapping the decoder IO space out from under them, when
> > loading the ivtvfb driver module, might not be a good thing. 
> 
> Using overlapping ioremap() calls with different memory types would address
> this concern provided hardware won't barf both on the device and CPU. Hardware
> folks could provide feedback or an ivtvfb user could test the patch supplied
> on both non-PAT and PAT systems. Even so, who knows,  this might work on some
> systems while not on others, only hardware folks would know.

The CX2341[56] firmware+hardware has a track record for being really
picky about sytem hardware.  It's primary symptoms are for the DMA
engine or Mailbox protocol to get hung up.  So yeah, it could barf
easily on some users.

> An alternative... is to just ioremap_wc() the entire region, including
> MMIO registers for these old devices.

That's my thought; as long as implementing PCI write then read can force
writes to be posted and that setting that many pages as WC doesn't cause
some sort of PAT resource exhaustion. (I know very little about PAT).

>  I see one ethernet driver that does
> this, myri10ge, and am curious how and why they ended up deciding this
> and if they have run into any issues. I wonder if this is a reasonable
> comrpomise for these 2 remaining corner cases.
> 
>   Luis

Regards,
Andy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [net-next 03/14] i40e: Add support to program FDir SB rules for VF from PF through ethtool

2015-04-15 Thread Jeff Kirsher

On Wed, 2015-04-15 at 11:58 +0300, Or Gerlitz wrote:
> On Wed, Apr 15, 2015 at 7:51 AM, Jeff Kirsher
>  wrote:
> > From: Anjali Singhai Jain 
> 
> > With this patch we can now add Flow director Sideband rules for a VF
> from
> > it's PF. Here is an example on how it can be done when VF id = 5 and
> > queue = 2:
> >
> > "ethtool -N ethx flow-type udp4 src-ip x.x.x.x dst-ip y.y.y.y
> src-port p1 dst-port p2 action 2 user-def 5"
> >
> > User-def specifies VF id and action specifies queue.
> 
> Hi Jeff,
> 
> We're too eager to have support for PF controlled ACL / over-ruling VF
> traffic.
> 
> I don't think the way to go here is just go an use the user-defined
> "data" field of struct ethtool_flow_ext to represent a VF -- since VF
> is well defined multiple vendor supported construct. At least in the
> command line would be better to see "vf 5" and not "user-def 5" -- and
> maybe also down into the user-kernel API somehow enhance the ethtool
> flow related structures such that VF ID is specified?

Thanks for the feedback Or, I will work with Anajali and Jesse to get
your suggestions implemented in follow-on patches.


signature.asc
Description: This is a digitally signed message part

Re: CodingStyle parenthesis alignment: was: Re: align to open paren

2015-04-15 Thread Joe Perches

On Wed, 2015-04-15 at 21:54 +, Hubbe, Allen wrote:
> I ran with the --fix option, and it changed every rejected indent to
> match the column of the open paren.  That's probably what you want,
> since it's the most consistent with the previous behavior.  The
> difference is that it does not fix lines that are no longer rejected,
> like if they are indented by two per paren.  The test file and output
> are attached.

There might be some utility adding a --style=
switch so checkpatch can appropriately warn on
K&R vs linux vs bsd vs gnu vs otb, etc...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v3] skbuff: Do not scrub skb mark within the same name space

2015-04-15 Thread Herbert Xu

On Wed, Apr 15, 2015 at 05:41:26PM +0200, Nicolas Dichtel wrote:
> Le 15/04/2015 15:57, Herbert Xu a écrit :
> >On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
> [snip]
> >Subject: skbuff: Do not scrub skb mark within the same name space
> >
> >The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
> Maybe add a Fixes tag?
> Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
> 
> >harmonize cleanup done on skb on rx path") broke anyone trying to
> >use netfilter marking across IPv4 tunnels.  While most of the
> >fields that are cleared by skb_scrub_packet don't matter, the
> >netfilter mark must be preserved.
> >
> >This patch rearranges skb_scurb_packet to preserve the mark field.
> nit: s/scurb/scrub
> 
> Else it's fine for me.

Sure.

PS I used the wrong email for James the first time around.  So
let me repeat the question here.  Should secmark be preserved
or cleared across tunnels within the same name space? In fact,
do our security models even support name spaces?

---8<---
The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scrub_packet to preserve the mark field.

Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")
Signed-off-by: Herbert Xu 

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3b6e583..a185427 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4124,19 +4124,22 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   if (xnet)
-   skb_orphan(skb);
skb->tstamp.tv64 = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
skb_dst_drop(skb);
-   skb->mark = 0;
skb_sender_cpu_clear(skb);
skb_init_secmark(skb);
secpath_reset(skb);
nf_reset(skb);
nf_reset_trace(skb);
+
+   if (!xnet)
+   return;
+
+   skb_orphan(skb);
+   skb->mark = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ioremap_uc() followed by set_memory_wc() - burrying MTRR

2015-04-15 Thread Luis R. Rodriguez

Hey Andy, thanks for your review,  adding Hyong-Youb Kim for  review of the
full range ioremap_wc() idea below.

On Wed, Apr 15, 2015 at 06:38:51PM -0400, Andy Walls wrote:
> Hi All,
> 
> On Mon, 2015-04-13 at 19:49 +0200, Luis R. Rodriguez wrote:
> > From the beginning it seems only framebuffer devices used MTRR/WC,
> [snip]
> >  The ivtv device is a good example of the worst type of
> > situations and these days. So perhap __arch_phys_wc_add() and a
> > ioremap_ucminus() might be something more than transient unless hardware 
> > folks
> > get a good memo or already know how to just Do The Right Thing (TM).
> 
> Just to reiterate a subtle point, use of the ivtvfb is *optional*.  A
> user may or may not load it.  When the user does load the ivtvfb driver,
> the ivtv driver has already been initialized and may have functions of
> the card already in use by userspace.

I suspected this and its why I note that a rewrite to address a clean
split with separate ioremap seems rather difficult in this case.

> Hopefully no one is trying to use the OSD as framebuffer and the video
> decoder/output engine for video display at the same time. 

Worst case concern I have also is the implications of having overlapping
ioremap() calls (as proposed in my last reply) for different memory types
and having the different virtual memory addresse used by different parts
of the driver. Its not clear to me what the hardware implications of this
are.

>  But the video
> decoder/output device nodes may already be open for performing ioctl()
> functions so unmapping the decoder IO space out from under them, when
> loading the ivtvfb driver module, might not be a good thing. 

Using overlapping ioremap() calls with different memory types would address
this concern provided hardware won't barf both on the device and CPU. Hardware
folks could provide feedback or an ivtvfb user could test the patch supplied
on both non-PAT and PAT systems. Even so, who knows,  this might work on some
systems while not on others, only hardware folks would know.

An alternative... is to just ioremap_wc() the entire region, including
MMIO registers for these old devices. I see one ethernet driver that does
this, myri10ge, and am curious how and why they ended up deciding this
and if they have run into any issues. I wonder if this is a reasonable
comrpomise for these 2 remaining corner cases.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: make mandocs build failure with next-20150407

2015-04-15 Thread Masanari Iida

On Wed, Apr 8, 2015 at 7:41 AM, Jim Davis  wrote:
>   DOCPROC Documentation/DocBook/80211.xml
> Error(.//include/net/mac80211.h:329): Cannot parse enum!
> Error(.//include/net/mac80211.h:367): Cannot parse enum!
> Warning(.//include/net/mac80211.h:381): No description found for
> parameter 'type'
> --

This error start to appear on Linus's tree.
make xmldocs stops after this error.
Then run it again,  make xmldocs process rest of the files.

Add TO:  netdev and patch authors for reporting the issue.

Masanari

> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] bpf: fix two bugs in verification logic when accessing 'ctx' pointer

2015-04-15 Thread Alexei Starovoitov

1.
first bug is a silly mistake. It broke tracing examples and prevented
simple bpf programs from loading.

In the following code:
if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
} else if (...) {
  // this part should have been executed when
  // insn->code == BPF_W and insn->imm != 0
}

Obviously it's not doing that. So simple instructions like:
r2 = *(u64 *)(r1 + 8)
will be rejected. Note the comments in the code around these branches
were and still valid and indicate the true intent.

Replace it with:
if (BPF_SIZE(insn->code) != BPF_W)
  continue;

if (insn->imm == 0) {
} else if (...) {
  // now this code will be executed when
  // insn->code == BPF_W and insn->imm != 0
}

2.
second bug is more subtle.
If malicious code is using the same dest register as source register,
the checks designed to prevent the same instruction to be used with different
pointer types will fail to trigger, since we were assigning src_reg_type
when it was already overwritten by check_mem_access().
The fix is trivial. Just move line:
src_reg_type = regs[insn->src_reg].type;
before check_mem_access().
Add new 'access skb fields bad4' test to check this case.

Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Alexei Starovoitov 
---
I would love to add the testcase for bug#1 as well, but it needs bigger
refactoring of test_verifier, so will do it after net-next reopens.

 kernel/bpf/verifier.c   |9 +++--
 samples/bpf/test_verifier.c |   22 ++
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 66bec36ec1ec..47dcd3aa6e23 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1637,6 +1637,8 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
+   src_reg_type = regs[insn->src_reg].type;
+
/* check that memory (src_reg + off) is readable,
 * the state of dst_reg will be updated by this func
 */
@@ -1646,9 +1648,12 @@ static int do_check(struct verifier_env *env)
if (err)
return err;
 
-   src_reg_type = regs[insn->src_reg].type;
+   if (BPF_SIZE(insn->code) != BPF_W) {
+   insn_idx++;
+   continue;
+   }
 
-   if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
+   if (insn->imm == 0) {
/* saw a valid insn
 * dst_reg = *(u32 *)(src_reg + off)
 * use reserved 'imm' field to mark this insn
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 9ab645698ffb..12f3780af73f 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -721,6 +721,28 @@ static struct bpf_test tests[] = {
.errstr = "different pointers",
.result = REJECT,
},
+   {
+   "access skb fields bad4",
+   .insns = {
+   BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 3),
+   BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1,
+   offsetof(struct __sk_buff, len)),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_map_lookup_elem),
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+   BPF_EXIT_INSN(),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+   BPF_JMP_IMM(BPF_JA, 0, 0, -13),
+   },
+   .fixup = {7},
+   .errstr = "different pointers",
+   .result = REJECT,
+   },
 };
 
 static int probe_filter_length(struct bpf_insn *fp)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Question] TCP stack performance decrease since 3.14

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 17:38 -0400, rapier wrote:

> I believe I properly disabled CPU power management in the bios (the 
> lenovo bios isn't terribly clear on this). I then booted with 
> processor.max_cstate=1 idle=poll (also tried with 
> intel_idle.max_cstate=0 and combinatiosn thereof). Still seeing reduced 
> performance in comparison to 3.14. I'll try using /dev/cpu_dma_latency 
> instead when I get in tomorrow. If you have other suggestions to verify 
> c-state I'd be happy to hear them.
> 
> As a note, 3.2 tests as being more than 18% faster in the above categories.

Wait a minute, are you testing tcp on a single laptop over loopback ?

multiple netperf consume a lot of ram and any change in kernel vmlinux
size can impact the performance simply because your cpu cache is too
small to cope with the increase.

Between 3.2 and 4.0 kernel, TCP networking changes are maybe 2 % of
overall changes.

2.6.1 kernel in 32bit mode will be much faster you know.

Maybe 40% faster.

# ls -l /boot/vmlinuz-*
-rwxr-xr-x 1 root root 3719696 2015-04-15 14:09 /boot/vmlinuz-3.14.0-smp
-rwxr-xr-x 1 root root 3911520 2015-04-15 14:30 /boot/vmlinuz-4.0.0-smp



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: CodingStyle parenthesis alignment: was: Re: align to open paren

2015-04-15 Thread Hubbe, Allen

> -Original Message-
> From: Joe Perches [mailto:j...@perches.com]
> Sent: Wednesday, April 15, 2015 5:07 PM
> To: Hubbe, Allen
> Cc: a...@canonical.com; LKML; netdev
> Subject: CodingStyle parenthesis alignment: was: Re: align to open paren
> 
> On Wed, 2015-04-15 at 20:34 +, Hubbe, Allen wrote:
> > Hello Andy, Joe,
> 
> Hello Allen.
> 
> As this is a discussion better suited for linux
> development lists I've cc'd LKML and netdev.
> 
> > I would like to find the origin of the decision to align to the open
> > paren in Linux.
> 
> Mostly it's a style decision shared by net/ and
> drivers/net/ and a few other directories.
> 
> It's a checkpatch --strict only test so it's not on
> by default except in net/ and drivers/net/.
> 

Thanks.

> > I found the commit where this was added a few years ago, d1fe9c0.
> > The email thread just says the style should be that way, but not why
> > or how the decision was made.  The how and the why is what I'm after,
> > since I have a critique of the chosen style.
> >
> > I realize there is a lot of code written using this stile, and
> > changing it would be disruptive.  I certainly would not ask for that.
> >
> > Indenting to the open paren might cause ambiguous indentation between
> > the parenthesized expression and the next logical thing.  In the
> > following, next_thing aligns to the same column as baz, even though
> > baz is part of the condition expression, and next_thing is the
> > continued statement.
> >
> > = if (foo(bar,
> > = baz))
> > = next_thing();
> >
> > It would be necessary to reindent to maintain style, even though the
> > code of the next lines is the same.  This has consequences like
> > changing the blame, for instance.  In the following, 4 + 5 is the bug,
> > but whoever replaced the global with an instance variable gets the
> > blame.
> 
> blame is overrated.
> git blame -w ignores most of the whitespace noise.
> 
> > = global_variable = foo(bar,
> > =   baz(1 + 3),
> > =   baz(4 + 5) + 6);
> > with
> > = obj->var = foo(bar,
> > =baz(1 + 3),
> > =baz(4 + 5) + 6);
> >
> > I'm used to the default in many editors, which is to indent twice
> > following each open paren, as opposed to once following each open
> > brace or continued statement.  It is a simpler rule for automatic
> > formatting to implement.  It also avoids mixing tabs and spaces, the
> > spacing is unambiguous, and to maintain style, there is no need to
> > re-indent lines following an edit if the position of the open paren
> > changes.
> >
> > It's interesting to me that this style is only enforced by
> > checkpatch.pl --strict.  It is not in Documents/CodingStyle.  That
> > being the case, would it be acceptable to relax the rule in
> > checkpatch.pl to accept either style?  If not, well, I'll just accept
> > the chosen style and fix my code.
> 
> I personally don't care much either way.
> 
> > If the following looks acceptable to you, I'll submit the patch to the
> > list.
> 
> The patch most likely wouldn't be appropriate for
> net/ and drivers/net/ where the developers seem to
> strongly prefer alignment to open parenthesis.
> 

I don't want to annoy the net developers.  Is it that the style is the same 
across the kernel, just more strongly enforced in net, or do different 
subsystem maintainers enforce different styles?  Again, I'll be happy to just 
accept the kernel style if that's the case.

> Also I think the open parenthesis count isn't right
> as it would ask for multiple indents for code like:
> 
>   if ((foo(bar)) &&
>   (baz(bar))) {

As this is properly indented to the open paren, it is accepted.

> 
> I think checkpatch would now want:
> 
>   if ((foo(bar)) &&
>   (baz(bar))) {
> 

CHECK: Alignment should match open parenthesis, or be twice indented for each 
open parenthesis
#51: FILE: foo.c:51:
+   if ((foo(bar)) &&
+   (baz(bar))) {

I count three opens and two closes in the first line, leaving one open paren, 
but the second line is further indented by four instead of two.  That is 
rejected, but it accepts this:

if ((foo(bar)) &&
(baz(bar))) {

> and the --fix option would be wrong too.

I ran with the --fix option, and it changed every rejected indent to match the 
column of the open paren.  That's probably what you want, since it's the most 
consistent with the previous behavior.  The difference is that it does not fix 
lines that are no longer rejected, like if they are indented by two per paren.  
The test file and output are attached.

> 
> cheers, Joe
> 
> > Thanks,
> > Allen
> >
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > index d124359..8e49125 100755
> > --- a/scripts/checkpatch.pl
> > +++ b/scripts/checkpatch.pl
> > @@ -1834,6 +1834,15 @@ sub pos_last_openparen {
> > return length(expand_tabs(substr($lin

Re: [net-next 00/14][pull request] Intel Wired LAN Driver Updates 2015-04-14

2015-04-15 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 14 Apr 2015 21:51:37 -0700

> This series contains updates to i40e and i40evf.

Pulled, thanks Jeff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Question] TCP stack performance decrease since 3.14

2015-04-15 Thread rapier




On 4/15/15 5:01 PM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 15:31 -0400, rapier wrote:

All,

First, my apologies if this came up previously but I couldn't find
anything using a keyword search of the mailing list archive.

As part of the on going work with web10g I need to come up with baseline
TCP stack performance for various kernel revision. Using netperf and
super_netperf* I've found that performance for TCP_CC, TCP_RR, and
TCP_CRR has decreased since 3.14.

3.143.184.0 decrease %
TCP_CC  183945  179222  175793  4.4%
TCP_RR  594495  585484  561365  5.6%
TCP_CRR 98677   96726   93026   5.7%

Stream tests have remained the same from 3.14 through 4.0.

All tests were conducted on the same platform from clean boot with stock
kernels.

So my questions are:

Has anyone else seen this or is this a result of some weirdness on my
system or artifact of my tests?

If others have seen this or is just simply to be expected (from new
features and the like) is it due to the TCP stack itself or other
changes in the kernel?

If so, is there anyway to mitigate the effect of this via stack tuning,
kernel configuration, etc?

Thanks!

Chris


* The above results are the average of 10 iterations of super_netperf
for each test. I can run more iterations to verify the results but it
seem consistent. The number of parallel processes for each test was
tuned to produce the maximum test result. In other words, enough to push
things but not enough to cause performance hits due to being
cpu/memory/etc bound. If anyone wants the full results and test scripts
just let me know.
--


Make sure you do not hit a c-state issue.

I've seen improvements in the stack translate to longer wait times, and
cpu takes longer to exit deep c-state.


I believe I properly disabled CPU power management in the bios (the 
lenovo bios isn't terribly clear on this). I then booted with 
processor.max_cstate=1 idle=poll (also tried with 
intel_idle.max_cstate=0 and combinatiosn thereof). Still seeing reduced 
performance in comparison to 3.14. I'll try using /dev/cpu_dma_latency 
instead when I get in tomorrow. If you have other suggestions to verify 
c-state I'd be happy to hear them.


As a note, 3.2 tests as being more than 18% faster in the above categories.

Chris
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Question] TCP stack performance decrease since 3.14

2015-04-15 Thread Rick Jones


On 04/15/2015 12:31 PM, rapier wrote:

All,

First, my apologies if this came up previously but I couldn't find
anything using a keyword search of the mailing list archive.

As part of the on going work with web10g I need to come up with baseline
TCP stack performance for various kernel revision. Using netperf and
super_netperf* I've found that performance for TCP_CC, TCP_RR, and
TCP_CRR has decreased since 3.14.

 3.143.184.0 decrease %
TCP_CC1839451792221757934.4%
TCP_RR5944955854845613655.6%
TCP_CRR9867796726930265.7%

Stream tests have remained the same from 3.14 through 4.0.


Have the service demands (usec of CPU consumed per KB) remained the same 
on the stream tests?  Even then, stateless offloads can help hide a 
multitude of path-length sins.



All tests were conducted on the same platform from clean boot with stock
kernels.

So my questions are:

Has anyone else seen this or is this a result of some weirdness on my
system or artifact of my tests?


I've wondered if such a thing might be taking place but never had a 
chance to check.


One thing you might consider is "perf" profiling to see how the CPU 
consumption break-down has changed.


happy benchmarking,

rick jones



If others have seen this or is just simply to be expected (from new
features and the like) is it due to the TCP stack itself or other
changes in the kernel?

If so, is there anyway to mitigate the effect of this via stack tuning,
kernel configuration, etc?

Thanks!

Chris


* The above results are the average of 10 iterations of super_netperf
for each test. I can run more iterations to verify the results but it
seem consistent. The number of parallel processes for each test was
tuned to produce the maximum test result. In other words, enough to push
things but not enough to cause performance hits due to being
cpu/memory/etc bound. If anyone wants the full results and test scripts
just let me know.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll

2015-04-15 Thread David Miller

From: Eric Dumazet 
Date: Tue, 14 Apr 2015 18:45:00 -0700

> From: Eric Dumazet 
> 
> Commit 9a2620c877454 ("bnx2x: prevent WARN during driver unload")
> switched the napi/busy_lock locking mechanism from spin_lock() into
> spin_lock_bh(), breaking inter-operability with netconsole, as netpoll
> disables interrupts prior to calling our napi mechanism.
> 
> This switches the driver into using atomic assignments instead of the
> spinlock mechanisms previously employed.
> 
> Based on initial patch from Yuval Mintz & Ariel Elior
> 
> I basically added softirq starvation avoidance, and mixture
> of atomic operations, plain writes and barriers.
> 
> Note this slightly reduces the overhead for this driver when no
> busy_poll sockets are in use.
> 
> Fixes: 9a2620c877454 ("bnx2x: prevent WARN during driver unload")
> Signed-off-by: Eric Dumazet 

Applied and queued up for -stable, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch v2] net: hip04: Make tx coalesce timer actually work

2015-04-15 Thread David Miller

From: Thomas Gleixner 
Date: Tue, 14 Apr 2015 21:42:42 +0200 (CEST)

> The code sets the expiry value of the timer to a relative value and
> starts it with hrtimer_start_expires. That's fine, but that only works
> once. The timer is started in relative mode, so the expiry value gets
> overwritten with the absolut expiry time (now + expiry).
> 
> So once the timer expired, a new call to hrtimer_start_expires results
> in an immidiately expired timer, because the expiry value is
> already in the past.
> 
> Use the proper mechanisms to (re)start the timer in the intended way.
> 
> Signed-off-by: Thomas Gleixner 

Applied, thanks Thomas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured

2015-04-15 Thread David Miller

From: Rafał Miłecki 
Date: Wed, 15 Apr 2015 22:21:49 +0200

> I guess the decisions depends on
> a) time needed for David to revert fc300dc & send pull request
> vs.
> b) time needed for Ralf to send pull request

Let's just wait for Ralf's stuff to hit Linus's tree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

CodingStyle parenthesis alignment: was: Re: align to open paren

2015-04-15 Thread Joe Perches

On Wed, 2015-04-15 at 20:34 +, Hubbe, Allen wrote:
> Hello Andy, Joe,

Hello Allen.

As this is a discussion better suited for linux
development lists I've cc'd LKML and netdev.

> I would like to find the origin of the decision to align to the open
> paren in Linux.

Mostly it's a style decision shared by net/ and
drivers/net/ and a few other directories.

It's a checkpatch --strict only test so it's not on
by default except in net/ and drivers/net/.

> I found the commit where this was added a few years ago, d1fe9c0.
> The email thread just says the style should be that way, but not why
> or how the decision was made.  The how and the why is what I'm after,
> since I have a critique of the chosen style.
> 
> I realize there is a lot of code written using this stile, and
> changing it would be disruptive.  I certainly would not ask for that.
> 
> Indenting to the open paren might cause ambiguous indentation between
> the parenthesized expression and the next logical thing.  In the
> following, next_thing aligns to the same column as baz, even though
> baz is part of the condition expression, and next_thing is the
> continued statement.
> 
> = if (foo(bar,
> = baz))
> = next_thing();
> 
> It would be necessary to reindent to maintain style, even though the
> code of the next lines is the same.  This has consequences like
> changing the blame, for instance.  In the following, 4 + 5 is the bug,
> but whoever replaced the global with an instance variable gets the
> blame.

blame is overrated.
git blame -w ignores most of the whitespace noise.

> = global_variable = foo(bar,
> =   baz(1 + 3),
> =   baz(4 + 5) + 6);
> with
> = obj->var = foo(bar,
> =baz(1 + 3),
> =baz(4 + 5) + 6);
> 
> I'm used to the default in many editors, which is to indent twice
> following each open paren, as opposed to once following each open
> brace or continued statement.  It is a simpler rule for automatic
> formatting to implement.  It also avoids mixing tabs and spaces, the
> spacing is unambiguous, and to maintain style, there is no need to
> re-indent lines following an edit if the position of the open paren
> changes.
> 
> It's interesting to me that this style is only enforced by
> checkpatch.pl --strict.  It is not in Documents/CodingStyle.  That
> being the case, would it be acceptable to relax the rule in
> checkpatch.pl to accept either style?  If not, well, I'll just accept
> the chosen style and fix my code.

I personally don't care much either way.

> If the following looks acceptable to you, I'll submit the patch to the
> list.

The patch most likely wouldn't be appropriate for
net/ and drivers/net/ where the developers seem to
strongly prefer alignment to open parenthesis.

Also I think the open parenthesis count isn't right
as it would ask for multiple indents for code like:

if ((foo(bar)) &&
(baz(bar))) {

I think checkpatch would now want:

if ((foo(bar)) &&
(baz(bar))) {

and the --fix option would be wrong too.

cheers, Joe

> Thanks,
> Allen
> 
> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> index d124359..8e49125 100755
> --- a/scripts/checkpatch.pl
> +++ b/scripts/checkpatch.pl
> @@ -1834,6 +1834,15 @@ sub pos_last_openparen {
> return length(expand_tabs(substr($line, 0, $last_openparen))) + 1;
>  }
> 
> +sub count_openparen {
> +   my ($line) = @_;
> +
> +   my $opens = $line =~ tr/\(/\(/;
> +   my $closes = $line =~ tr/\)/\)/;
> +
> +   return $opens - $closes;
> +}
> +
>  sub process {
> my $filename = shift;
> 
> @@ -2539,11 +2548,16 @@ sub process {
> " "  x ($pos % 8);
> my $goodspaceindent = $oldindent . " "  x 
> $pos;
> 
> +   my $goodtwotabindent = $oldindent .
> +   "\t\t" x count_openparen($rest);
> +
> if ($newindent ne $goodtabindent &&
> -   $newindent ne $goodspaceindent) {
> +   $newindent ne $goodspaceindent &&
> +   $newindent ne $goodtwotabindent) {
> 
> if (CHK("PARENTHESIS_ALIGNMENT",
> -   "Alignment should match open 
> parenthesis\n" . $hereprev) &&
> +   "Alignment should match open 
> parenthesis, " .
> +   "or be twice indented for 
> each open parenthesis\n" . $hereprev) &&
> $fix && $line =~ /^\+/) {
> $fixed[$fixlinenr] =~
> s/^\+[ 
> \t]*/\+$goodtabindent/;



--
To unsubscribe from this list: send the line

Re: [Question] TCP stack performance decrease since 3.14

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 15:31 -0400, rapier wrote:
> All,
> 
> First, my apologies if this came up previously but I couldn't find 
> anything using a keyword search of the mailing list archive.
> 
> As part of the on going work with web10g I need to come up with baseline 
> TCP stack performance for various kernel revision. Using netperf and 
> super_netperf* I've found that performance for TCP_CC, TCP_RR, and 
> TCP_CRR has decreased since 3.14.
> 
>   3.143.184.0 decrease %
> TCP_CC183945  179222  175793  4.4%
> TCP_RR594495  585484  561365  5.6%
> TCP_CRR   98677   96726   93026   5.7%
> 
> Stream tests have remained the same from 3.14 through 4.0.
> 
> All tests were conducted on the same platform from clean boot with stock 
> kernels.
> 
> So my questions are:
> 
> Has anyone else seen this or is this a result of some weirdness on my 
> system or artifact of my tests?
> 
> If others have seen this or is just simply to be expected (from new 
> features and the like) is it due to the TCP stack itself or other 
> changes in the kernel?
> 
> If so, is there anyway to mitigate the effect of this via stack tuning, 
> kernel configuration, etc?
> 
> Thanks!
> 
> Chris
> 
> 
> * The above results are the average of 10 iterations of super_netperf 
> for each test. I can run more iterations to verify the results but it 
> seem consistent. The number of parallel processes for each test was 
> tuned to produce the maximum test result. In other words, enough to push 
> things but not enough to cause performance hits due to being 
> cpu/memory/etc bound. If anyone wants the full results and test scripts 
> just let me know.
> --

Make sure you do not hit a c-state issue.

I've seen improvements in the stack translate to longer wait times, and
cpu takes longer to exit deep c-state.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 2/5] stmmac: Add defines and documentation for enabling flow control

2015-04-15 Thread Vince Bridgers

Add defines and documentation for enabling flow control on the stmmac. Flow
control was not implemented correctly on the stmmac driver and is currently
non-functional as a result. This is the first in a series of small patches
to correctly implement this feature.

Signed-off-by: Vince Bridgers 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h | 51 +
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
index 64d8f56..b3fe057 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
@@ -172,6 +172,7 @@ enum inter_frame_gap {
 /* GMAC FLOW CTRL defines */
 #define GMAC_FLOW_CTRL_PT_MASK 0x  /* Pause Time Mask */
 #define GMAC_FLOW_CTRL_PT_SHIFT16
+#define GMAC_FLOW_CTRL_UP  0x0008  /* Unicast pause frame enable */
 #define GMAC_FLOW_CTRL_RFE 0x0004  /* Rx Flow Control Enable */
 #define GMAC_FLOW_CTRL_TFE 0x0002  /* Tx Flow Control Enable */
 #define GMAC_FLOW_CTRL_FCB_BPA 0x0001  /* Flow Control Busy ... */
@@ -246,6 +247,56 @@ enum ttc_control {
 #define DMA_CONTROL_FEF0x0080
 #define DMA_CONTROL_FUF0x0040
 
+/* Receive flow control activation field
+ * RFA field in DMA control register, bits 23,10:9
+ */
+#define DMA_CONTROL_RFA_MASK   0x00800600
+
+/* Receive flow control deactivation field
+ * RFD field in DMA control register, bits 22,12:11
+ */
+#define DMA_CONTROL_RFD_MASK   0x00401800
+
+/* RFD and RFA fields are encoded as follows
+ *
+ *   Bit Field
+ *   0,00 - Full minus 1KB (only valid when rxfifo >= 4KB and EFC enabled)
+ *   0,01 - Full minus 2KB (only valid when rxfifo >= 4KB and EFC enabled)
+ *   0,10 - Full minus 3KB (only valid when rxfifo >= 4KB and EFC enabled)
+ *   0,11 - Full minus 4KB (only valid when rxfifo > 4KB and EFC enabled)
+ *   1,00 - Full minus 5KB (only valid when rxfifo > 8KB and EFC enabled)
+ *   1,01 - Full minus 6KB (only valid when rxfifo > 8KB and EFC enabled)
+ *   1,10 - Full minus 7KB (only valid when rxfifo > 8KB and EFC enabled)
+ *   1,11 - Reserved
+ *
+ * RFD should always be > RFA for a given FIFO size. RFD == RFA may work,
+ * but packet throughput performance may not be as expected.
+ *
+ * Be sure that bit 3 in GMAC Register 6 is set for Unicast Pause frame
+ * detection (IEEE Specification Requirement, Annex 31B, 31B.1, Pause
+ * Description).
+ *
+ * Be sure that DZPA (bit 7 in Flow Control Register, GMAC Register 6),
+ * is set to 0. This allows pause frames with a quanta of 0 to be sent
+ * as an XOFF message to the link peer.
+ */
+
+#define RFA_FULL_MINUS_1K  0x
+#define RFA_FULL_MINUS_2K  0x0200
+#define RFA_FULL_MINUS_3K  0x0400
+#define RFA_FULL_MINUS_4K  0x0600
+#define RFA_FULL_MINUS_5K  0x0080
+#define RFA_FULL_MINUS_6K  0x00800200
+#define RFA_FULL_MINUS_7K  0x00800400
+
+#define RFD_FULL_MINUS_1K  0x
+#define RFD_FULL_MINUS_2K  0x0800
+#define RFD_FULL_MINUS_3K  0x1000
+#define RFD_FULL_MINUS_4K  0x1800
+#define RFD_FULL_MINUS_5K  0x0040
+#define RFD_FULL_MINUS_6K  0x00400800
+#define RFD_FULL_MINUS_7K  0x00401000
+
 enum rtc_control {
DMA_CONTROL_RTC_64 = 0x,
DMA_CONTROL_RTC_32 = 0x0008,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 1/5] stmmac: Add properties for transmit and receive fifo sizes

2015-04-15 Thread Vince Bridgers

The Synopsys stmmac fifo sizes are configurable, and need to be known
in order to configure certain controller features. This patch adds
tx-fifo-depth and rx-fifo-depth properties to the stmmac document
file.

Signed-off-by: Vince Bridgers 
---
 Documentation/devicetree/bindings/net/ethernet.txt | 6 ++
 Documentation/devicetree/bindings/net/stmmac.txt   | 4 
 2 files changed, 10 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt 
b/Documentation/devicetree/bindings/net/ethernet.txt
index 3fc3605..41b3f3f 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -19,6 +19,12 @@ The following properties are common to the Ethernet 
controllers:
 - phy: the same as "phy-handle" property, not recommended for new bindings.
 - phy-device: the same as "phy-handle" property, not recommended for new
   bindings.
+- rx-fifo-depth: the size of the controller's receive fifo in bytes. This
+  is used for components that can have configurable receive fifo sizes,
+  and is useful for determining certain configuration settings such as
+  flow control thresholds.
+- tx-fifo-depth: the size of the controller's transmit fifo in bytes. This
+  is used for components that can have configurable fifo sizes.
 
 Child nodes of the Ethernet controller are typically the individual PHY devices
 connected via the MDIO bus (sometimes the MDIO bus controller is separate).
diff --git a/Documentation/devicetree/bindings/net/stmmac.txt 
b/Documentation/devicetree/bindings/net/stmmac.txt
index 8ca65ce..524c185 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -44,6 +44,8 @@ Optional properties:
   If not passed then the system clock will be used and this is fine on some
   platforms.
 - snps,burst_len: The AXI burst lenth value of the AXI BUS MODE register.
+- tx-fifo-depth: See ethernet.txt file in the same directory
+- rx-fifo-depth: See ethernet.txt file in the same directory
 
 Examples:
 
@@ -58,6 +60,8 @@ Examples:
phy-mode = "gmii";
snps,multicast-filter-bins = <256>;
snps,perfect-filter-entries = <128>;
+   rx-fifo-depth = <16384>;
+   tx-fifo-depth = <16384>;
clocks = <&clock>;
clock-names = "stmmaceth";
};
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 4/5] stmmac: Enable unicast pause frame detect in GMAC Register 6

2015-04-15 Thread Vince Bridgers

Unicast pause frame detect was not being enabled for the Synopsys stmmac. This
patch sets Unicast pause frame detect in MAC register 6 so that pause frame
detection by the stmmac conforms to IEEE 802.3, Annex 31B.3.3 Receive
Operation - Specifically, a MAC shall respond to pause frames containing
either the reserved multicast address or the unique physical address
associated with this station.

Signed-off-by: Vince Bridgers 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index 0adcf73..371a669 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -201,7 +201,10 @@ static void dwmac1000_flow_ctrl(struct mac_device_info 
*hw, unsigned int duplex,
unsigned int fc, unsigned int pause_time)
 {
void __iomem *ioaddr = hw->pcsr;
-   unsigned int flow = 0;
+   /* Set flow such that DZPQ in Mac Register 6 is 0,
+* and unicast pause detect is enabled.
+*/
+   unsigned int flow = GMAC_FLOW_CTRL_UP;
 
pr_debug("GMAC Flow-Control:\n");
if (fc & FLOW_RX) {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured

2015-04-15 Thread Guenter Roeck

On Wed, Apr 15, 2015 at 10:21:49PM +0200, Rafał Miłecki wrote:
> On 15 April 2015 at 22:05, Guenter Roeck  wrote:
> > arm:allmodconfig fails to build as follows since ARCH_BCM_5301X
> > is configured but not BCM47XX.
> >
> > drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe':
> > drivers/net/ethernet/broadcom/bgmac.c:1643:2: error:
> > implicit declaration of function 'bcm47xx_nvram_getenv'
> >
> > Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X")
> > Cc: Rafał Miłecki 
> > Signed-off-by: Guenter Roeck 
> > ---
> > Seen in today's upstream kernel.
> >
> > I don't like this fix too much (I think it is quite kludgy),
> > so I marked it RFC (and please don't beat the messenger ;-).
> 
> Ooh great, I totally forgot about this :|
> 
> The problem is that fc300dc (bgmac: allow enabling on ARCH_BCM_5301X)
> [0] shouldn't really be sent (by me) in the first place. This is
> because it depends on 138173d (MIPS: BCM47xx: Move NVRAM header to the
> include/linux/.) [1] which isn't in Linus's tree yet.
> 
> So there are two solutions:
> 1) Revert fc300dc, wait for 138173d and re-apply fc300dc
> 2) Wait for 138173d with this build breakage
> 
> I guess the decisions depends on
> a) time needed for David to revert fc300dc & send pull request
> vs.
> b) time needed for Ralf to send pull request
> 
> David, Ralf, what do you think about this?
> 

Since the fix is in the queue, maybe we can live with the breakage
for a few days ? Your solution is definitely _much_ better than my hack.

Guenter
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 3/5] stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree

2015-04-15 Thread Vince Bridgers

Read the tx-fifo-depth and rx-fifo-depth from the devicetree. The Synopsys
stmmac controller fifos are configurable per product instance, and the fifo
sizes are needed to configure certain features correctly such as flow control.

Signed-off-by: Vince Bridgers 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 4 
 include/linux/stmmac.h| 2 ++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index f9b42f1..705bbdf 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -181,6 +181,10 @@ static int stmmac_probe_config_dt(struct platform_device 
*pdev,
 sizeof(struct stmmac_mdio_bus_data),
 GFP_KERNEL);
 
+   of_property_read_u32(np, "tx-fifo-depth", &plat->tx_fifo_size);
+
+   of_property_read_u32(np, "rx-fifo-depth", &plat->rx_fifo_size);
+
plat->force_sf_dma_mode =
of_property_read_bool(np, "snps,force_sf_dma_mode");
 
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index cd63851..7f484a2 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -114,6 +114,8 @@ struct plat_stmmacenet_data {
int maxmtu;
int multicast_filter_bins;
int unicast_filter_entries;
+   int tx_fifo_size;
+   int rx_fifo_size;
void (*fix_mac_speed)(void *priv, unsigned int speed);
void (*bus_setup)(void __iomem *ioaddr);
void *(*setup)(struct platform_device *pdev);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net 5/5] stmmac: Configure Flow Control to work correctly based on rxfifo size

2015-04-15 Thread Vince Bridgers

Configure flow control correctly, and based on the receive fifo size read
as a property from the devicetree since the Synopsys stmmac fifo sizes are
configurable based on a particular chip's implementation. This patch maintains
the previous incorrect behavior unless the receive fifo size is found in the
devicetree.

Signed-off-by: Vince Bridgers 
---
 drivers/net/ethernet/stmicro/stmmac/common.h   |  5 +++--
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c| 26 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 16 -
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index cd77289..623c6ed 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -150,7 +150,7 @@ struct stmmac_extra_stats {
 #defineMAC_CSR_H_FRQ_MASK  0x20
 
 #define HASH_TABLE_SIZE 64
-#define PAUSE_TIME 0x200
+#define PAUSE_TIME 0x
 
 /* Flow Control defines */
 #define FLOW_OFF   0
@@ -357,7 +357,8 @@ struct stmmac_dma_ops {
void (*dump_regs) (void __iomem *ioaddr);
/* Set tx/rx threshold in the csr6 register
 * An invalid value enables the store-and-forward mode */
-   void (*dma_mode) (void __iomem *ioaddr, int txmode, int rxmode);
+   void (*dma_mode)(void __iomem *ioaddr, int txmode, int rxmode,
+int rxfifosz);
/* To track extra statistic (if supported) */
void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x,
   void __iomem *ioaddr);
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index 59d92e8..0e8937c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -106,8 +106,29 @@ static int dwmac1000_dma_init(void __iomem *ioaddr, int 
pbl, int fb, int mb,
return 0;
 }
 
+static u32 dwmac1000_configure_fc(u32 csr6, int rxfifosz)
+{
+   csr6 &= ~DMA_CONTROL_RFA_MASK;
+   csr6 &= ~DMA_CONTROL_RFD_MASK;
+
+   /* Leave flow control disabled if receive fifo size is less than
+* 4K or 0. Otherwise, send XOFF when fifo is 1K less than full,
+* and send XON when 2K less than full.
+*/
+   if (rxfifosz < 4096) {
+   csr6 &= ~DMA_CONTROL_EFC;
+   pr_debug("GMAC: disabling flow control, rxfifo too small(%d)\n",
+rxfifosz);
+   } else {
+   csr6 |= DMA_CONTROL_EFC;
+   csr6 |= RFA_FULL_MINUS_1K;
+   csr6 |= RFD_FULL_MINUS_2K;
+   }
+   return csr6;
+}
+
 static void dwmac1000_dma_operation_mode(void __iomem *ioaddr, int txmode,
-int rxmode)
+int rxmode, int rxfifosz)
 {
u32 csr6 = readl(ioaddr + DMA_CONTROL);
 
@@ -153,6 +174,9 @@ static void dwmac1000_dma_operation_mode(void __iomem 
*ioaddr, int txmode,
csr6 |= DMA_CONTROL_RTC_128;
}
 
+   /* Configure flow control based on rx fifo size */
+   csr6 = dwmac1000_configure_fc(csr6, rxfifosz);
+
writel(csr6, ioaddr + DMA_CONTROL);
 }
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
index 7d1dce9..9d0971c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c
@@ -72,7 +72,7 @@ static int dwmac100_dma_init(void __iomem *ioaddr, int pbl, 
int fb, int mb,
  * control register.
  */
 static void dwmac100_dma_operation_mode(void __iomem *ioaddr, int txmode,
-   int rxmode)
+   int rxmode, int rxfifosz)
 {
u32 csr6 = readl(ioaddr + DMA_CONTROL);
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index a0ea84f..73d104f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1277,8 +1277,10 @@ static void free_dma_desc_resources(struct stmmac_priv 
*priv)
  */
 static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
 {
+   int rxfifosz = priv->plat->rx_fifo_size;
+
if (priv->plat->force_thresh_dma_mode)
-   priv->hw->dma->dma_mode(priv->ioaddr, tc, tc);
+   priv->hw->dma->dma_mode(priv->ioaddr, tc, tc, rxfifosz);
else if (priv->plat->force_sf_dma_mode || priv->plat->tx_coe) {
/*
 * In case of GMAC, SF mode can be enabled
@@ -1287,10 +1289,12 @@ static void stmmac_dma_operation_mode(struct 
stmmac_priv *priv)
 * 2) There is no bugged Jumbo frame support

[PATCH net 0/5] stmmac: Correct flow control configuration

2015-04-15 Thread Vince Bridgers

This series of patches corrects flow control configuration for the Synopsys
GMAC driver.

Flow control is configured based on a configurable receive fifo size. If
less than 4Kbytes flow control is left disabled and a warning is presented. If
a receive fifo size is not specified, flow control is left disabled to
maintain current behavior. Unicast pause detection was disabled, but is now 
enabled. The pause time was modified to be maximum time per a XON/XOFF
flow control mode of operation.

This patch was tested on an Altera Cyclone 5 and an Altera Arria 10 devkit,
and verified that flow control operates as expected when enabled.

Please consider this series for inclusion so that flow control will
function as expected for the Synopsys GMAC controller. 

Vince Bridgers (5):
  stmmac: Add properties for transmit and receive fifo sizes
  stmmac: Add defines and documentation for enabling flow control
  stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
  stmmac: Enable unicast pause frame detect in GMAC Register 6
  stmmac: Configure Flow Control to work correctly based on rxfifo size

 Documentation/devicetree/bindings/net/ethernet.txt |  6 +++
 Documentation/devicetree/bindings/net/stmmac.txt   |  4 ++
 drivers/net/ethernet/stmicro/stmmac/common.h   |  5 ++-
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h| 51 ++
 .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |  5 ++-
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c| 26 ++-
 drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 16 ---
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |  4 ++
 include/linux/stmmac.h |  2 +
 10 files changed, 111 insertions(+), 10 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Question] TCP stack performance decrease since 3.14

2015-04-15 Thread rapier


All,

First, my apologies if this came up previously but I couldn't find 
anything using a keyword search of the mailing list archive.


As part of the on going work with web10g I need to come up with baseline 
TCP stack performance for various kernel revision. Using netperf and 
super_netperf* I've found that performance for TCP_CC, TCP_RR, and 
TCP_CRR has decreased since 3.14.


3.143.184.0 decrease %
TCP_CC  183945  179222  175793  4.4%
TCP_RR  594495  585484  561365  5.6%
TCP_CRR 98677   96726   93026   5.7%

Stream tests have remained the same from 3.14 through 4.0.

All tests were conducted on the same platform from clean boot with stock 
kernels.


So my questions are:

Has anyone else seen this or is this a result of some weirdness on my 
system or artifact of my tests?


If others have seen this or is just simply to be expected (from new 
features and the like) is it due to the TCP stack itself or other 
changes in the kernel?


If so, is there anyway to mitigate the effect of this via stack tuning, 
kernel configuration, etc?


Thanks!

Chris


* The above results are the average of 10 iterations of super_netperf 
for each test. I can run more iterations to verify the results but it 
seem consistent. The number of parallel processes for each test was 
tuned to produce the maximum test result. In other words, enough to push 
things but not enough to cause performance hits due to being 
cpu/memory/etc bound. If anyone wants the full results and test scripts 
just let me know.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured

2015-04-15 Thread Rafał Miłecki

On 15 April 2015 at 22:05, Guenter Roeck  wrote:
> arm:allmodconfig fails to build as follows since ARCH_BCM_5301X
> is configured but not BCM47XX.
>
> drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe':
> drivers/net/ethernet/broadcom/bgmac.c:1643:2: error:
> implicit declaration of function 'bcm47xx_nvram_getenv'
>
> Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X")
> Cc: Rafał Miłecki 
> Signed-off-by: Guenter Roeck 
> ---
> Seen in today's upstream kernel.
>
> I don't like this fix too much (I think it is quite kludgy),
> so I marked it RFC (and please don't beat the messenger ;-).

Ooh great, I totally forgot about this :|

The problem is that fc300dc (bgmac: allow enabling on ARCH_BCM_5301X)
[0] shouldn't really be sent (by me) in the first place. This is
because it depends on 138173d (MIPS: BCM47xx: Move NVRAM header to the
include/linux/.) [1] which isn't in Linus's tree yet.

So there are two solutions:
1) Revert fc300dc, wait for 138173d and re-apply fc300dc
2) Wait for 138173d with this build breakage

I guess the decisions depends on
a) time needed for David to revert fc300dc & send pull request
vs.
b) time needed for Ralf to send pull request

David, Ralf, what do you think about this?

Sorry for causing this problem :|

[0] 
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fc300dc3733fdc328e6e10c7b8379b60c26cd648
[1] 
http://git.linux-mips.org/cgit/ralf/upstream-sfr.git/commit/?id=138173d4e826587da66c7d321da1a91283222536
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones


On 04/15/2015 11:32 AM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:


Well, I'm not sure that it is George and Jonathan themselves who don't
want to change a sysctl, but the customers who would have to tweak that
in their VMs?


Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.


That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).


It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.


I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.


rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] bgmac: Fix build error seen if BCM47XX is not configured

2015-04-15 Thread Guenter Roeck

arm:allmodconfig fails to build as follows since ARCH_BCM_5301X
is configured but not BCM47XX.

drivers/net/ethernet/broadcom/bgmac.c: In function 'bgmac_probe':
drivers/net/ethernet/broadcom/bgmac.c:1643:2: error:
implicit declaration of function 'bcm47xx_nvram_getenv'

Fixes: fc300dc3733f ("bgmac: allow enabling on ARCH_BCM_5301X")
Cc: Rafał Miłecki 
Signed-off-by: Guenter Roeck 
---
Seen in today's upstream kernel.

I don't like this fix too much (I think it is quite kludgy),
so I marked it RFC (and please don't beat the messenger ;-).

 drivers/net/ethernet/broadcom/bgmac.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
b/drivers/net/ethernet/broadcom/bgmac.c
index 5cb93d1f50a4..dcf27a7c1836 100644
--- a/drivers/net/ethernet/broadcom/bgmac.c
+++ b/drivers/net/ethernet/broadcom/bgmac.c
@@ -17,8 +17,21 @@
 #include 
 #include 
 #include 
+
+#ifdef CONFIG_BCM47XX
 #include 
 
+static int __bcm47xx_nvram_getenv(const char *name, char *val, size_t len)
+{
+   return bcm47xx_nvram_getenv(name, val, len);
+}
+#else
+static int __bcm47xx_nvram_getenv(const char *name, char *val, size_t len)
+{
+   return -ENOENT;
+}
+#endif
+
 static const struct bcma_device_id bgmac_bcma_tbl[] = {
BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_4706_MAC_GBIT, BCMA_ANY_REV, 
BCMA_ANY_CLASS),
BCMA_CORE(BCMA_MANUF_BCM, BCMA_CORE_MAC_GBIT, BCMA_ANY_REV, 
BCMA_ANY_CLASS),
@@ -1070,7 +1083,7 @@ static void bgmac_chip_reset(struct bgmac *bgmac)
 BGMAC_CHIPCTL_1_IF_TYPE_MII;
char buf[4];
 
-   if (bcm47xx_nvram_getenv("et_swtype", buf, sizeof(buf)) > 0) {
+   if (__bcm47xx_nvram_getenv("et_swtype", buf, sizeof(buf)) > 0) {
if (kstrtou8(buf, 0, &et_swtype))
bgmac_err(bgmac, "Failed to parse et_swtype 
(%s)\n",
  buf);
@@ -1635,7 +1648,7 @@ static int bgmac_probe(struct bcma_device *core)
}
 
bgmac->int_mask = BGMAC_IS_ERRMASK | BGMAC_IS_RX | BGMAC_IS_TX_MASK;
-   if (bcm47xx_nvram_getenv("et0_no_txint", NULL, 0) == 0)
+   if (__bcm47xx_nvram_getenv("et0_no_txint", NULL, 0) == 0)
bgmac->int_mask &= ~BGMAC_IS_TX_MASK;
 
/* TODO: reset the external phy. Specs are needed */
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] bpf: fix bpf helpers to use skb->mac_header relative offsets

2015-04-15 Thread Alexei Starovoitov

For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Daniel Borkmann 
---
 include/uapi/linux/bpf.h|2 +-
 include/uapi/linux/filter.h |7 +--
 net/core/filter.c   |   41 -
 net/sched/act_bpf.c |3 +++
 net/sched/cls_bpf.c |3 +++
 samples/bpf/tcbpf1_kern.c   |   16 ++--
 6 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5c1cee11f777..a9ebdf5701e8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -177,7 +177,7 @@ enum bpf_func_id {
/**
 * skb_store_bytes(skb, offset, from, len, flags) - store bytes into 
packet
 * @skb: pointer to skb
-* @offset: offset within packet from skb->data
+* @offset: offset within packet from skb->mac_header
 * @from: pointer where to copy bytes from
 * @len: number of bytes to store into packet
 * @flags: bit 0 - if true, recompute skb->csum
diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
index 34c7936ca114..c97340e43dd6 100644
--- a/include/uapi/linux/filter.h
+++ b/include/uapi/linux/filter.h
@@ -79,8 +79,11 @@ struct sock_fprog {  /* Required for SO_ATTACH_FILTER. */
 #define SKF_AD_RANDOM  56
 #define SKF_AD_VLAN_TPID   60
 #define SKF_AD_MAX 64
-#define SKF_NET_OFF   (-0x10)
-#define SKF_LL_OFF(-0x20)
 
+#define SKF_NET_OFF(-0x10)
+#define SKF_LL_OFF (-0x20)
+
+#define BPF_NET_OFFSKF_NET_OFF
+#define BPF_LL_OFF SKF_LL_OFF
 
 #endif /* _UAPI__LINUX_FILTER_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index b669e75d2b36..bf831a85c315 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1175,12 +1175,27 @@ int sk_attach_bpf(u32 ufd, struct sock *sk)
return 0;
 }
 
+/**
+ * bpf_skb_clone_not_writable - is the header of a clone not writable
+ * @skb: buffer to check
+ * @len: length up to which to write, can be negative
+ *
+ * Returns true if modifying the header part of the cloned buffer
+ * does require the data to be copied. I.e. this version works with
+ * negative lengths needed for eBPF case!
+ */
+static bool bpf_skb_clone_unwritable(const struct sk_buff *skb, int len)
+{
+   return skb_header_cloned(skb) ||
+  (int) skb_headroom(skb) + len > skb->hdr_len;
+}
+
 #define BPF_RECOMPUTE_CSUM(flags)  ((flags) & 1)
 
 static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 flags)
 {
struct sk_buff *skb = (struct sk_buff *) (long) r1;
-   unsigned int offset = (unsigned int) r2;
+   int offset = (int) r2;
void *from = (void *) (long) r3;
unsigned int len = (unsigned int) r4;
char buf[16];
@@ -1194,10 +1209,12 @@ static u64 bpf_skb_store_bytes(u64 r1, u64 r2, u64 r3, 
u64 r4, u64 flags)
 *
 * so check for invalid 'offset' and too large 'len'
 */
-   if (unlikely(offset > 0x || len > sizeof(buf)))
+   if (unlikely((u32) offset > 0x || len > sizeof(buf)))
return -EFAULT;
 
-   if (skb_cloned(skb) && !skb_clone_writable(skb, offset + len))
+   offset -= skb->data - skb_mac_header(skb);
+   if (unlikely(skb_cloned(skb) &&
+bpf_skb_clone_unwritable(skb, offset + len)))
return -EFAULT;
 
ptr = skb_header_pointer(skb, offset, len, buf);
@@ -1232,15 +1249,18 @@ const struct bpf_func_proto bpf_skb_store_bytes_proto = 
{
 #define BPF_HEADER_FIELD_SIZE(flags)   ((flags) & 0x0f)
 #define BPF_IS_PSEUDO_HEADER(flags)((flags) & 0x10)
 
-static u64 bpf_l3_csum_replace(u64 r1, u64 offset, u64 from, u64 to, u64 flags)
+static u64 bpf_l3_csum_replace(u64 r1, u64 r2, u64 from, u64 to, u64 flags)
 {
struct sk_buff *skb = (stru

Re: [PATCH iproute2 -next] tc: built-in eBPF exec proxy

2015-04-15 Thread Hannes Frederic Sowa

On Wed, Apr 15, 2015, at 16:52, Daniel Borkmann wrote:
> This work follows upon commit 6256f8c9e45f ("tc, bpf: finalize eBPF
> support for cls and act front-end") and takes up the idea proposed by
> Hannes Frederic Sowa to spawn a shell (or any other command) that holds
> generated eBPF map file descriptors.
> 
> File descriptors, based on their id, are being fetched from the same
> unix domain socket as demonstrated in the bpf_agent, the shell spawned
> via execvpe(2) and the map fds passed over the environment, and thus
> are made available to applications in the fashion of std{in,out,err}
> for read/write access, for example in case of iproute2's examples/bpf/:
> 
>   # env | grep BPF
>   BPF_NUM_MAPS=3
>   BPF_MAP1=6<- BPF_MAP_ID_QUEUE (id 1)
>   BPF_MAP0=5<- BPF_MAP_ID_PROTO (id 0)
>   BPF_MAP2=7<- BPF_MAP_ID_DROPS (id 2)
> 
>   # ls -la /proc/self/fd
>   [...]
>   lrwx--. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4
>   lrwx--. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4
>   lrwx--. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4
>   [...]
>   lrwx--. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
>   lrwx--. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
>   lrwx--. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
> 
> The advantage (as opposed to the direct/native usage) is that now the
> shell is map fd owner and applications can terminate and easily reattach
> to descriptors w/o any kernel changes. Moreover, multiple applications
> can easily read/write eBPF maps simultaneously.
> 
> To further allow users for experimenting with that, next step is to add
> a small helper that can get along with simple data types, so that also
> shell scripts can make use of bpf syscall, f.e to read/write into maps.
> 
> Generally, this allows for prepopulating maps, or any runtime altering
> which could influence eBPF program behaviour (f.e. different run-time
> classifications, skb modifications, ...), dumping of statistics, etc.
> 
> Reference:
> http://thread.gmane.org/gmane.linux.network/357471/focus=357860
> Suggested-by: Hannes Frederic Sowa 
> Signed-off-by: Daniel Borkmann 

Great worl!

Reviewed-by: Hannes Frederic Sowa 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Patch net-next v2] fou: avoid missing unlock in failure path

2015-04-15 Thread Cong Wang

Fixes: 7a6c8c34e5b7 ("fou: implement FOU_CMD_GET")
Reported-by: Dan Carpenter 
Cc: Dan Carpenter 
Signed-off-by: Cong Wang 
---
 net/ipv4/fou.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index af150b4..34968cd 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -711,11 +711,10 @@ static int fou_nl_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
cb->nlh->nlmsg_seq, NLM_F_MULTI,
skb, FOU_CMD_GET);
if (ret)
-   goto done;
+   break;
}
mutex_unlock(&fn->fou_lock);
 
-done:
cb->args[0] = idx;
return skb->len;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 0/2] switchdev: unify naming prefix

2015-04-15 Thread Jiri Pirko

Wed, Apr 15, 2015 at 08:42:12PM CEST, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Wed, 15 Apr 2015 20:38:15 +0200
>
>> Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote:
>>>On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote:
 Turned out that "switchdev" sticks. So just unify all releated terms to
 use this prefix.
 
 Jiri Pirko (2):
   switchdev: s/netdev_switch_/switchdev_/ and
 s/NETDEV_SWITCH_/SWITCHDEV_/
   switchdev: s/swdev_/switchdev_/
 
  drivers/net/bonding/bond_main.c  |   4 +-
  drivers/net/ethernet/rocker/rocker.c |  47 +-
  drivers/net/team/team.c  |   4 +-
  include/linux/netdevice.h|   2 +-
  include/net/switchdev.h  | 140 ++--
  net/bridge/br.c  |  22 ++---
  net/bridge/br_netlink.c  |   6 +-
  net/bridge/br_stp.c  |   2 +-
  net/core/net-sysfs.c |   2 +-
  net/core/rtnetlink.c |   2 +-
  net/dsa/slave.c  |   8 +-
  net/ipv4/fib_trie.c  |  40 
  net/switchdev/switchdev.c| 174 
 +--
  13 files changed, 224 insertions(+), 229 deletions(-)
>>>
>>>Acked-by: Andy Gospodarek 
>>>
>> 
>> Unfortunatelly, this patchset did not make it into this net-next. Will
>> resubmit with the acks once the merge-window closes.
>
>Scott said he would integrate the name change into his series, since it's
>now deferred as well.

Oh, ok, I missed this info.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 0/2] switchdev: unify naming prefix

2015-04-15 Thread David Miller

From: Jiri Pirko 
Date: Wed, 15 Apr 2015 20:38:15 +0200

> Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote:
>>On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote:
>>> Turned out that "switchdev" sticks. So just unify all releated terms to
>>> use this prefix.
>>> 
>>> Jiri Pirko (2):
>>>   switchdev: s/netdev_switch_/switchdev_/ and
>>> s/NETDEV_SWITCH_/SWITCHDEV_/
>>>   switchdev: s/swdev_/switchdev_/
>>> 
>>>  drivers/net/bonding/bond_main.c  |   4 +-
>>>  drivers/net/ethernet/rocker/rocker.c |  47 +-
>>>  drivers/net/team/team.c  |   4 +-
>>>  include/linux/netdevice.h|   2 +-
>>>  include/net/switchdev.h  | 140 ++--
>>>  net/bridge/br.c  |  22 ++---
>>>  net/bridge/br_netlink.c  |   6 +-
>>>  net/bridge/br_stp.c  |   2 +-
>>>  net/core/net-sysfs.c |   2 +-
>>>  net/core/rtnetlink.c |   2 +-
>>>  net/dsa/slave.c  |   8 +-
>>>  net/ipv4/fib_trie.c  |  40 
>>>  net/switchdev/switchdev.c| 174 
>>> +--
>>>  13 files changed, 224 insertions(+), 229 deletions(-)
>>
>>Acked-by: Andy Gospodarek 
>>
> 
> Unfortunatelly, this patchset did not make it into this net-next. Will
> resubmit with the acks once the merge-window closes.

Scott said he would integrate the name change into his series, since it's
now deferred as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 0/2] switchdev: unify naming prefix

2015-04-15 Thread Jiri Pirko

Wed, Apr 15, 2015 at 07:54:56PM CEST, go...@cumulusnetworks.com wrote:
>On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote:
>> Turned out that "switchdev" sticks. So just unify all releated terms to
>> use this prefix.
>> 
>> Jiri Pirko (2):
>>   switchdev: s/netdev_switch_/switchdev_/ and
>> s/NETDEV_SWITCH_/SWITCHDEV_/
>>   switchdev: s/swdev_/switchdev_/
>> 
>>  drivers/net/bonding/bond_main.c  |   4 +-
>>  drivers/net/ethernet/rocker/rocker.c |  47 +-
>>  drivers/net/team/team.c  |   4 +-
>>  include/linux/netdevice.h|   2 +-
>>  include/net/switchdev.h  | 140 ++--
>>  net/bridge/br.c  |  22 ++---
>>  net/bridge/br_netlink.c  |   6 +-
>>  net/bridge/br_stp.c  |   2 +-
>>  net/core/net-sysfs.c |   2 +-
>>  net/core/rtnetlink.c |   2 +-
>>  net/dsa/slave.c  |   8 +-
>>  net/ipv4/fib_trie.c  |  40 
>>  net/switchdev/switchdev.c| 174 
>> +--
>>  13 files changed, 224 insertions(+), 229 deletions(-)
>
>Acked-by: Andy Gospodarek 
>

Unfortunatelly, this patchset did not make it into this net-next. Will
resubmit with the acks once the merge-window closes.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open

2015-04-15 Thread David Miller

From: Eric Dumazet 
Date: Wed, 15 Apr 2015 11:22:44 -0700

> On Wed, 2015-04-15 at 19:00 +0100, Ben Hutchings wrote:
>> Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly")
>> changed tcp_send_syn_data() to perform an open-coded copy of the 'syn'
>> skb rather than using skb_copy_expand().
>> 
>> The open-coded copy does not cover the skb_shared_info::gso_segs
>> field, so in the new skb it is left set to 0.  When this commit was
>> backported into stable branches between 3.10.y and 3.16.7-ckty
>> inclusive, it triggered the BUG() in tcp_transmit_skb().
>> 
>> Since Linux 3.18 the GSO segment count is kept in the
>> tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the
>> tcp_skb_cb structure to the new skb, so mainline and newer stable
>> branches are not affected.
>> 
>> Set skb_shared_info::gso_segs to the correct value of 1.
>> 
>> Signed-off-by: Ben Hutchings 
 ...
> Looks goot to me, thanks Ben !
> 
> Acked-by: Eric Dumazet 

Ben, thanks for taking care of this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

> Well, I'm not sure that it is George and Jonathan themselves who don't 
> want to change a sysctl, but the customers who would have to tweak that 
> in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:

> Maybe you should stop wasting all of our time and just tell us what
> you're thinking.

I think you make me wasting my time.

I already gave all the hints in prior discussions.

Rome was not built in one day.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 19:00 +0100, Ben Hutchings wrote:
> Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly")
> changed tcp_send_syn_data() to perform an open-coded copy of the 'syn'
> skb rather than using skb_copy_expand().
> 
> The open-coded copy does not cover the skb_shared_info::gso_segs
> field, so in the new skb it is left set to 0.  When this commit was
> backported into stable branches between 3.10.y and 3.16.7-ckty
> inclusive, it triggered the BUG() in tcp_transmit_skb().
> 
> Since Linux 3.18 the GSO segment count is kept in the
> tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the
> tcp_skb_cb structure to the new skb, so mainline and newer stable
> branches are not affected.
> 
> Set skb_shared_info::gso_segs to the correct value of 1.
> 
> Signed-off-by: Ben Hutchings 
> ---
>  net/ipv4/tcp_output.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d5457e4..1ea0a07 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2992,6 +2992,7 @@ static int tcp_send_syn_data(struct sock *sk, struct 
> sk_buff *syn)
>   goto fallback;
>   syn_data->ip_summed = CHECKSUM_PARTIAL;
>   memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
> + skb_shinfo(syn_data)->gso_segs = 1;
>   if (unlikely(memcpy_fromiovecend(skb_put(syn_data, space),
>fo->data->msg_iov, 0, space))) {
>   kfree_skb(syn_data);
> 

Looks goot to me, thanks Ben !

Acked-by: Eric Dumazet 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones


On 04/15/2015 11:08 AM, Eric Dumazet wrote:

On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:


Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.


Woudln't the fq_codel qdisc on that interface address that problem?


Last time I checked, default qdisc was pfifo_fast.


Bummer.


These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?


Well, I'm not sure that it is George and Jonathan themselves who don't 
want to change a sysctl, but the customers who would have to tweak that 
in their VMs?


rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 18:58 +0100, Stefano Stabellini wrote:
> On Wed, 15 Apr 2015, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> > 
> > > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > > *already* larger for Xen; that calculation mentioned in the comment is
> > > *already* doing the right thing.
> > 
> > Sigh.
> > 
> > 1ms of traffic at 40Gbit is 5 MBytes
> > 
> > The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> > to provide the limitation of ~2 TSO packets, which _also_ is documented.
> > 
> > Without this limitation, 5 MBytes could translate to : Fill the queue,
> > do not limit.
> > 
> > If a particular driver needs to extend the limit, fine, document it and
> > take actions.
> 
> What actions do you have in mind exactly?  It would be great if you
> could suggest how to move forward from here, beside documentation.
> 
> I don't think we can really expect every user that spawns a new VM in
> the cloud to manually echo blah >
> /proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
> imagine that would work well.

I already pointed a discussion on the same topic for wireless adapters.

Some adapters have a ~3 ms TX completion delay, so the 1ms assumption in
TCP stack is limiting the max throughput.

All I hear here are unreasonable requests, marketing driven.

If a global sysctl is not good enough, make it a per device value.

We already have netdev->gso_max_size and netdev->gso_max_segs
which are cached into sk->sk_gso_max_size & sk->sk_gso_max_segs

What about you guys provide a new 
netdev->I_need_to_have_big_buffers_to_cope_with_my_latencies.

Do not expect me to fight bufferbloat alone. Be part of the challenge,
instead of trying to get back to proven bad solutions.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
> >
> > Have you tested this patch on a NIC without GSO/TSO ?
> >
> > This would allow more than 500 packets for a single flow.
> >
> > Hello bufferbloat.
> 
> Woudln't the fq_codel qdisc on that interface address that problem?

Last time I checked, default qdisc was pfifo_fast.

These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread George Dunlap

On 04/15/2015 06:52 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:
> 
>> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
>>
>>  -George
>>
>> ---
>> TSQ: Raise default static TSQ limit
>>
>> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
>> size of actual packets and the amount of data being transmitted.
>> Raise the default static limit to allow that new limit to actually
>> come into effect.
>>
>> This fixes a regression where NICs with large transmit completion
>> times (such as xennet) had a 30% hit unless the user manually tweaked
>> the value in /proc.
>>
>> Signed-off-by: George Dunlap 
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..8ad7cdf 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>>   */
>>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>>
>> -/* Default TSQ limit of two TSO segments */
>> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
>> +/* Static TSQ limit.  A more dynamic limit is calculated in
>> tcp_write_xmit. */
>> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
>>
>>  /* This limits the percentage of the congestion window which we
>>   * will allow a single TSO frame to consume.  Building TSO frames
>>
> 
> Have you tested this patch on a NIC without GSO/TSO ?
> 
> This would allow more than 500 packets for a single flow.
> 
> Hello bufferbloat.
> 
> So my answer to this patch is a no.

You said:

"I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes  You have no need to explain me the code I
wrote, thank you."

Which implies to me that you think you've already pointed us to the
answer you want and we're just not getting it.

Maybe you should stop wasting all of our time and just tell us what
you're thinking.

 -George
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Stefano Stabellini

On Wed, 15 Apr 2015, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> 
> > Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> > *already* larger for Xen; that calculation mentioned in the comment is
> > *already* doing the right thing.
> 
> Sigh.
> 
> 1ms of traffic at 40Gbit is 5 MBytes
> 
> The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
> to provide the limitation of ~2 TSO packets, which _also_ is documented.
> 
> Without this limitation, 5 MBytes could translate to : Fill the queue,
> do not limit.
> 
> If a particular driver needs to extend the limit, fine, document it and
> take actions.

What actions do you have in mind exactly?  It would be great if you
could suggest how to move forward from here, beside documentation.

I don't think we can really expect every user that spawns a new VM in
the cloud to manually echo blah >
/proc/sys/net/ipv4/tcp_limit_output_bytes to an init script.  I cannot
imagine that would work well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH stable 3.10-3.16] tcp: Fix crash in TCP Fast Open

2015-04-15 Thread Ben Hutchings

Commit 355a901e6cf1 ("tcp: make connect() mem charging friendly")
changed tcp_send_syn_data() to perform an open-coded copy of the 'syn'
skb rather than using skb_copy_expand().

The open-coded copy does not cover the skb_shared_info::gso_segs
field, so in the new skb it is left set to 0.  When this commit was
backported into stable branches between 3.10.y and 3.16.7-ckty
inclusive, it triggered the BUG() in tcp_transmit_skb().

Since Linux 3.18 the GSO segment count is kept in the
tcp_skb_cb::tcp_gso_segs field and tcp_send_syn_data() does copy the
tcp_skb_cb structure to the new skb, so mainline and newer stable
branches are not affected.

Set skb_shared_info::gso_segs to the correct value of 1.

Signed-off-by: Ben Hutchings 
---
 net/ipv4/tcp_output.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d5457e4..1ea0a07 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2992,6 +2992,7 @@ static int tcp_send_syn_data(struct sock *sk, struct 
sk_buff *syn)
goto fallback;
syn_data->ip_summed = CHECKSUM_PARTIAL;
memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
+   skb_shinfo(syn_data)->gso_segs = 1;
if (unlikely(memcpy_fromiovecend(skb_put(syn_data, space),
 fo->data->msg_iov, 0, space))) {
kfree_skb(syn_data);

-- 
Ben Hutchings
Editing code like this is akin to sticking plasters on the bleeding stump
of a severed limb. - me, 29 June 1999


signature.asc
Description: This is a digitally signed message part

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Rick Jones



Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.


Woudln't the fq_codel qdisc on that interface address that problem?

rick

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch net-next v2 0/2] switchdev: unify naming prefix

2015-04-15 Thread Andy Gospodarek

On Mon, Apr 13, 2015 at 11:25:02PM +0200, Jiri Pirko wrote:
> Turned out that "switchdev" sticks. So just unify all releated terms to
> use this prefix.
> 
> Jiri Pirko (2):
>   switchdev: s/netdev_switch_/switchdev_/ and
> s/NETDEV_SWITCH_/SWITCHDEV_/
>   switchdev: s/swdev_/switchdev_/
> 
>  drivers/net/bonding/bond_main.c  |   4 +-
>  drivers/net/ethernet/rocker/rocker.c |  47 +-
>  drivers/net/team/team.c  |   4 +-
>  include/linux/netdevice.h|   2 +-
>  include/net/switchdev.h  | 140 ++--
>  net/bridge/br.c  |  22 ++---
>  net/bridge/br_netlink.c  |   6 +-
>  net/bridge/br_stp.c  |   2 +-
>  net/core/net-sysfs.c |   2 +-
>  net/core/rtnetlink.c |   2 +-
>  net/dsa/slave.c  |   8 +-
>  net/ipv4/fib_trie.c  |  40 
>  net/switchdev/switchdev.c| 174 
> +--
>  13 files changed, 224 insertions(+), 229 deletions(-)

Acked-by: Andy Gospodarek 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:

> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
> 
>  -George
> 
> ---
> TSQ: Raise default static TSQ limit
> 
> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
> size of actual packets and the amount of data being transmitted.
> Raise the default static limit to allow that new limit to actually
> come into effect.
> 
> This fixes a regression where NICs with large transmit completion
> times (such as xennet) had a 30% hit unless the user manually tweaked
> the value in /proc.
> 
> Signed-off-by: George Dunlap 
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..8ad7cdf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
> 
> -/* Default TSQ limit of two TSO segments */
> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +/* Static TSQ limit.  A more dynamic limit is calculated in
> tcp_write_xmit. */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
> 
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
> 

Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.

So my answer to this patch is a no.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread George Dunlap

On 04/15/2015 06:29 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
>> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
>>> My thoughts that instead of these long talks you should guys read the
>>> code :
>>>
>>> /* TCP Small Queues :
>>>  * Control number of packets in qdisc/devices to two 
>>> packets / or ~1 ms.
>>>  * This allows for :
>>>  *  - better RTT estimation and ACK scheduling
>>>  *  - faster recovery
>>>  *  - high rates
>>>  * Alas, some drivers / subsystems require a fair amount
>>>  * of queued bytes to ensure line rate.
>>>  * One example is wifi aggregation (802.11 AMPDU)
>>>  */
>>> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>
>>> Then you'll see that most of your questions are already answered.
>>>
>>> Feel free to try to improve the behavior, if it does not hurt critical 
>>> workloads
>>> like TCP_RR, where we we send very small messages, millions times per 
>>> second.
>>
>> First of all, with regard to critical workloads, once this patch gets
>> into distros, *normal TCP streams* on every VM running on Amazon,
>> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
>> Normal TCP streams on xennet *are* a critical workload, and deserve the
>> same kind of accommodation as TCP_RR (if not more).  The same goes for
>> virtio_net.
>>
>> Secondly, according to Stefano's and Jonathan's tests,
>> tcp_limit_output_bytes completely fixes the problem for Xen.
>>
>> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
>> *already* larger for Xen; that calculation mentioned in the comment is
>> *already* doing the right thing.
>>
>> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
>> automatic TSQ calculation which is actually choosing an effective value
>> for xennet.
>>
>> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
>> actual maximum limit.  I went back and looked at the original patch
>> which introduced it (46d3ceabd), and it looks to me like it was designed
>> to be a rough, quick estimate of "two packets outstanding" (by choosing
>> the maximum size of the packet, 64k, and multiplying it by two).
>>
>> Now that you have a better algorithm -- the size of 2 actual packets or
>> the amount transmitted in 1ms -- it seems like the default
>> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
>> TSQ you have on the first line throttle things down when necessary.
> 
> 
> I asked you guys to make a test by increasing
> sysctl_tcp_limit_output_bytes

So you'd be OK with a patch like this?  (With perhaps a better changelog?)

 -George

---
TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.

This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap 

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;

-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in
tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;

 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames

TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.
 
This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap 

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
 
 /* This limits the

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:

> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.

Sigh.

1ms of traffic at 40Gbit is 5 MBytes

The reason for the cap to /proc/sys/net/ipv4/tcp_limit_output_bytes is
to provide the limitation of ~2 TSO packets, which _also_ is documented.

Without this limitation, 5 MBytes could translate to : Fill the queue,
do not limit.

If a particular driver needs to extend the limit, fine, document it and
take actions.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> > My thoughts that instead of these long talks you should guys read the
> > code :
> > 
> > /* TCP Small Queues :
> >  * Control number of packets in qdisc/devices to two 
> > packets / or ~1 ms.
> >  * This allows for :
> >  *  - better RTT estimation and ACK scheduling
> >  *  - faster recovery
> >  *  - high rates
> >  * Alas, some drivers / subsystems require a fair amount
> >  * of queued bytes to ensure line rate.
> >  * One example is wifi aggregation (802.11 AMPDU)
> >  */
> > limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> > limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> > 
> > 
> > Then you'll see that most of your questions are already answered.
> > 
> > Feel free to try to improve the behavior, if it does not hurt critical 
> > workloads
> > like TCP_RR, where we we send very small messages, millions times per 
> > second.
> 
> First of all, with regard to critical workloads, once this patch gets
> into distros, *normal TCP streams* on every VM running on Amazon,
> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
> Normal TCP streams on xennet *are* a critical workload, and deserve the
> same kind of accommodation as TCP_RR (if not more).  The same goes for
> virtio_net.
> 
> Secondly, according to Stefano's and Jonathan's tests,
> tcp_limit_output_bytes completely fixes the problem for Xen.
> 
> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
> *already* larger for Xen; that calculation mentioned in the comment is
> *already* doing the right thing.
> 
> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
> automatic TSQ calculation which is actually choosing an effective value
> for xennet.
> 
> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
> actual maximum limit.  I went back and looked at the original patch
> which introduced it (46d3ceabd), and it looks to me like it was designed
> to be a rough, quick estimate of "two packets outstanding" (by choosing
> the maximum size of the packet, 64k, and multiplying it by two).
> 
> Now that you have a better algorithm -- the size of 2 actual packets or
> the amount transmitted in 1ms -- it seems like the default
> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
> TSQ you have on the first line throttle things down when necessary.


I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes

You have no need to explain me the code I wrote, thank you.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread George Dunlap

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
> /* TCP Small Queues :
>  * Control number of packets in qdisc/devices to two packets 
> / or ~1 ms.
>  * This allows for :
>  *  - better RTT estimation and ACK scheduling
>  *  - faster recovery
>  *  - high rates
>  * Alas, some drivers / subsystems require a fair amount
>  * of queued bytes to ensure line rate.
>  * One example is wifi aggregation (802.11 AMPDU)
>  */
> limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
> limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical 
> workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 14:03 +, Yuval Mintz wrote:
> > >
> > > BTW, this looks quite generic - isn't it possible to take it out of
> > > the driver and push it into the networking core, uniforming it in the 
> > > process?
> > 
> > Yes, this is the plan I have in mind, once net-next is opened again ;)
> > 
> > Thanks !
> > 
> 
> I'm not familiar with any real strong test-suite for the busy poll, but I did 
> try
> running all kinds of netperf connections with various busy_{poll,read} values
> and performance seemed reasonable, and verified that netconsole manages
> to work.
> 
> So for all it's worth,
> Tested-by: Yuval Mintz 

Thanks a lot !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] bpf: fix verifier memory corruption

2015-04-15 Thread Daniel Borkmann


On 04/15/2015 12:57 AM, Alexei Starovoitov wrote:

Due to missing bounds check the DAG pass of the BPF verifier can corrupt
the memory which can cause random crashes during program loading:

[8.449451] BUG: unable to handle kernel paging request at 
[8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
[8.452329] Oops:  [#1] SMP
[8.452329] Call Trace:
[8.452329]  [] bpf_check+0x852/0x2000
[8.452329]  [] bpf_prog_load+0x1e4/0x310
[8.452329]  [] ? might_fault+0x5f/0xb0
[8.452329]  [] SyS_bpf+0x806/0xa30

Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
Signed-off-by: Alexei Starovoitov 


As far as I can tell, looks good to me. Any other access to a next
instruction elsewhere would be blocked from push_insn() with an error.

Acked-by: Daniel Borkmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 15:36 +0100, Ian Campbell wrote:
> On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
> > On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
> [...]
> > > From a networking point of view, the backend is a switch. Is it OK to
> > > consider the packet to have been transmitted from the guest point of
> > > view once the backend is aware of the packet?
> > >
> > > This would help justify the skb_orphan() in the frontend.
> > 
> > This sounds sensible to me, particularly if virtio_net is already doing it.
> 
> I also find Malcolm's argument above pretty compelling.

Yes, and then you'll have to help the virtio ongoing effort trying to
get rid of this skb_orphan()

Basically you're adding head of line blocking, as a single flow is able
to fill your queue.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread Eric Dumazet

On Wed, 2015-04-15 at 14:43 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet  wrote:
> > On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
> >
> >> Is the problem perhaps that netback/netfront delays TX completion?
> >> Would it be better to see if that can be addressed properly, so that
> >> the original purpose of the patch (fighting bufferbloat) can be
> >> achieved while not degrading performance for Xen?  Or at least, so
> >> that people get decent perfomance out of the box without having to
> >> tweak TCP parameters?
> >
> > Sure, please provide a patch, that does not break back pressure.
> >
> > But just in case, if Xen performance relied on bufferbloat, it might be
> > very difficult to reach a stable equilibrium : Any small change in stack
> > or scheduling might introduce a significant difference in 'raw
> > performance'.
> 
> So help me understand this a little bit here.  tcp_limit_output_bytes
> limits the amount of data allowed to be "in-transit" between a send()
> and the wire, is that right?
> 
> And so the "bufferbloat" problem you're talking about here are TCP
> buffers inside the kernel, and/or buffers in the NIC, is that right?
> 
> So ideally, you want this to be large enough to fill the "pipeline"
> all the way from send() down to actually getting out on the wire;
> otherwise, you'll have gaps in the pipeline, and the machinery won't
> be working at full throttle.
> 
> And the reason it's a problem is that many NICs now come with large
> send buffers; and effectively what happens then is that this makes the
> "pipeline" longer -- as the buffer fills up, the time between send()
> and the wire is increased.  This increased latency causes delays in
> round-trip-times and interferes with the mechanisms TCP uses to try to
> determine what the actual sustainable rate of data trasmission is.
> 
> By limiting the number of "in-transit" bytes, you make sure that
> neither the kernel nor the NIC are going to have packets queues up for
> long lengths of time in buffers, and you keep this "pipeline" as close
> to the actual minimal length of the pipeline as possible.  And it
> sounds like for your 40G NIC, 128k is big enough to fill the pipeline
> without unduly making it longer by introducing buffering.
> 
> Is that an accurate picture of what you're trying to achieve?
> 
> But the problem for xennet (and a number of other drivers), as I
> understand it, is that at the moment the "pipeline" itself is just
> longer -- it just takes a longer time from the time you send a packet
> to the time it actually gets out on the wire.
> 
> So it's not actually accurate to say that "Xen performance relies on
> bufferbloat".  There's no buffering involved -- the pipeline is just
> longer, and so to fill up the pipeline you need more data.
> 
> Basically, to maximize throughput while minimizing buffering, for
> *any* connection, tcp_limit_output_bytes should ideally be around
> (min_tx_latency * max_bandwidth).  For physical NICs, the minimum
> latency is really small, but for xennet -- and I'm guessing for a lot
> of virtualized cards -- the min_tx_latency will be a lot higher,
> requiring a much higher ideal tcp_limit_output value.
> 
> Rather than trying to pick a single value which will be good for all
> NICs, it seems like it would make more sense to have this vary
> depending on the parameters of the NIC.  After all, for NICs that have
> low throughput -- say, old 100MiB NICs -- even 128k may still
> introduce a significant amount of buffering.
> 
> Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case.
> 
> Probably the most sustainable solution would be to have the networking
> layer keep track of the average and minimum transmit latencies, and
> automatically adjust tcp_limit_output_bytes based on that.  (Keeping
> the minimum as well as the average because the whole problem with
> bufferbloat is that the more data you give it, the longer the apparent
> "pipeline" becomes.)
> 
> Thoughts?

My thoughts that instead of these long talks you should guys read the
code :

/* TCP Small Queues :
 * Control number of packets in qdisc/devices to two packets / 
or ~1 ms.
 * This allows for :
 *  - better RTT estimation and ACK scheduling
 *  - faster recovery
 *  - high rates
 * Alas, some drivers / subsystems require a fair amount
 * of queued bytes to ensure line rate.
 * One example is wifi aggregation (802.11 AMPDU)
 */
limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
limit = min_t(u32, limit, sysctl_tcp_l

Re: [PATCH 3/4] mac80211-hwsim: Pass rate-ctrl flags and tx-power to user-space

2015-04-15 Thread Ben Greear

On 04/14/2015 01:15 AM, Johannes Berg wrote:
> On Fri, 2015-04-03 at 14:12 -0700, gree...@candelatech.com wrote:
> 
>> +/* Auxilary info to allow user-space to better understand the rate */
>> +struct hwsim_tx_rate2 {
>> +u16 rc_flags; /* rate-ctrl flags (see mac80211_rate_control_flags) */
> 
> I really don't like this - it ties internal API to userspace API. You
> may argue that this is userspace that is for debug purposes only, but
> I'm sure you'll also scratch your head very confused when I change the
> rate control flags for any reason and your code breaks :)

Is this a documentation only issue (at this point, while code matches)?

Then later, we can add translation to keep user-space API the same
as needed?

And in that case, maybe I'll make it u32 to give room down the road?

Thanks,
Ben


-- 
Ben Greear 
Candela Technologies Inc  http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support

2015-04-15 Thread Alexei Starovoitov

On Wed, Apr 15, 2015 at 12:35:16AM -0700, John Fastabend wrote:
>
> I'll dig up my scripts and post them to github this weekend. They
> are a bit organized and all over the place at the moment.
> 
> Maybe we can build a master repository. I know there a lot of different
> scripts running around, for example I already collected a few from
> Jamal and I think Cong must have some as well.

great. let's start building a proper testsuite.

> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 4cdbfb8..a2542ac 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -69,7 +69,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct
> Qdisc *sch)
>   switch (result) {
>   case TC_ACT_SHOT:
>   result = TC_ACT_SHOT;
> - qdisc_qstats_drop(sch);
> + qdisc_qstats_drop_cpu(sch);

Your quick patch missed updating:
- qdisc_bstats_update(sch, skb);
+ qdisc_bstats_update_cpu(sch, skb);

I've tried the same :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] bpf: fix verifier memory corruption

2015-04-15 Thread Hannes Frederic Sowa

On Wed, Apr 15, 2015, at 18:07, Alexei Starovoitov wrote:
> On 4/15/15 8:59 AM, Hannes Frederic Sowa wrote:
> > On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote:
> >> Due to missing bounds check the DAG pass of the BPF verifier can corrupt
> >> the memory which can cause random crashes during program loading:
> >>
> >> [8.449451] BUG: unable to handle kernel paging request at 
> >> [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
> >> [8.452329] Oops:  [#1] SMP
> >> [8.452329] Call Trace:
> >> [8.452329]  [] bpf_check+0x852/0x2000
> >> [8.452329]  [] bpf_prog_load+0x1e4/0x310
> >> [8.452329]  [] ? might_fault+0x5f/0xb0
> >> [8.452329]  [] SyS_bpf+0x806/0xa30
> >>
> >> Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
> >> Signed-off-by: Alexei Starovoitov 
> >> ---
> >> Many things need to align for this crash to be seen, yet I managed to hit 
> >> it.
> >> In my case JA was last insn, 't' was 255 and explored_states array
> >> had 256 elements. I've double checked other similar paths and all seems 
> >> clean.
> >>
> >>   kernel/bpf/verifier.c |3 ++-
> >>   1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index a28e09c7825d..36508e69e92a 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -1380,7 +1380,8 @@ peek_stack:
> >>/* tell verifier to check for equivalent states
> >> * after every call and jump
> >> */
> >> -  env->explored_states[t + 1] = STATE_LIST_MARK;
> >> +  if (t + 1 < insn_cnt)
> >> +  env->explored_states[t + 1] = STATE_LIST_MARK;
> >>} else {
> >>/* conditional jump with two edges */
> >>ret = push_insn(t, t + 1, FALLTHROUGH, env);
> >
> > Quick review:
> >
> > We have env->explored_states[t+1] access in the
> >
> >  } else {
> >  /* conditional jump with two edges */
> >  ret = push_insn(t, t + 1, FALLTHROUGH, env);
> >  if (ret == 1)
> >  goto peek_stack;
> >  else if (ret < 0)
> >  goto err_free;
> >
>   ret = push_insn(t, t + insns[t].off + 1, BRANCH, 
>  env);
> >  if (ret == 1)
> >  goto peek_stack;
> >  else if (ret < 0)
> >  goto err_free;
> >  }
> >  } else {
> >
> >
> > push_insn call. At this point insn[t].off could be 0, no?
> 
> insn[t].off can be anything, but the first thing that push_insn()
> checks is:
> if (w < 0 || w >= env->prog->len)
> only then it does:
> env->explored_states[w] = STATE_LIST_MARK;
> so we're good there.
> Though thanks for triple checking :)

Sorry, yes. That check was too obvious to me. ;)

Acked-by: Hannes Frederic Sowa 

Bye,
Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/4] cxgb4/iw_cxgb4/cxgb4i: remove duplicate definitions

2015-04-15 Thread Jason Gunthorpe

On Wed, Apr 15, 2015 at 06:34:27PM +0530, Varun Prakash wrote:
> On Mon, Apr 13, 2015 at 10:38:08AM -0600, Jason Gunthorpe wrote:
> > On Mon, Apr 13, 2015 at 07:34:23PM +0530, Varun Prakash wrote:
> > > define struct ulptx_idata in common header file t4_msg.h
> > > to remove duplicate definitions.
> > 
> > The Infiniband side of this patch looks OK.
> > 
> > Reviewed-By: Jason Gunthorpe 
> > 
> > Just some random thoughts on the other patches:
> >  - Try and use 'if (IS_ENABLED(CONFIG_XX))' over #ifdef 
> >to improve compile test coverage. This would drop a fair number
> >of ifdefs.
> 
> FCoE specific structures and functions are defined only if 
> CONFIG_CHELSIO_T4_FCOE is enabled.

I saw a number of places that could be switched, you should look
through.

> #ifdef CONFIG_CHELSIO_T4_FCOE
> void cxgb_fcoe_init_ddp(struct adapter *adap)
> {
>   ...
> }
> #endif
> 
> If CONFIG_CHELSIO_T4_FCOE is disabled then following code will
> result in build error "implicit declaration of function cxgb_fcoe_init_ddp"
>
> if (IS_ENABLED(CONFIG_CHELSIO_T4_FCOE))
>   cxgb_fcoe_init_ddp(adap);

Commonly the ifdef would be moved to the function header and a static
inline cxgb_fcoe_init_ddp(..) {} declared for the ifndef case. For the
case of static function calls, just drop the ifdefing entirely. The
compiler will silently drop unused static functions. Think carefully
about ifdefing out structure members - I don't think saving a few bytes of
heap in a driver like cxgb4 is worthwhile.

I also just noticed that

drivers/net/ethernet/chelsio/cxgb4/cxgb4_fcoe.c

Is wrapped in a whole file #ifdef CONFIG_CHELSIO_T4_FCOE - please use
the makefile instead for this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread Jesper Dangaard Brouer

On Fri, 10 Apr 2015 15:47:34 +0200
Daniel Borkmann  wrote:

> On 04/10/2015 02:15 PM, Pablo Neira Ayuso wrote:
> > This patch splits __netif_receive_skb_core() in smaller functions to improve
> > maintainability.
> >
> > The function __netif_receive_skb_core() has been split in two:
> >
> > * __netif_receive_skb_ingress(), to perform all actions up to
> >ingress filtering.
> >
> > * __netif_receive_skb_finish(), if the ingress filter accepts this
> >packet, pass it to the corresponding packet_type function handler for 
> > further
> > processing.
> >
> > This patch also adds __NET_RX_ANOTHER_ROUND that is used when the packet is
> > stripped off from the vlan header or in case the rx_handler needs it.
> >
> > This also prepares the introduction of the netfilter ingress hook.
> 
> Out of curiosity, what is actually the performance impact on all
> of this? We were just arguing on a different matter on two more
> instructions in the fast-path, here it's refactoring the whole
> function into several ones, I presume gcc won't inline it.

Pablo asked me to performance test this change.  Full test report below.

The performance effect (of this patch) depend on the Gcc compiler
version.

Two tests:
 1. IP-forwarding (unloaded netfilter modules)
 2. Early drop in iptables "raw" table

With GCC 4.4.7, which does not inline the new functions
(__netif_receive_skb_ingress and __netif_receive_skb_finish) the
performance impact/regression is definitly measurable.

With GCC 4.4.7:
 1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
 2. Early-drop   :  +7.55 ns (slower) (-66577 pps)

With GCC 4.9.1, the new functions gets inlined, thus the refactor
splitup of __netif_receive_skb_core() is basically "cancled".
Strangly there is a small improvement for forwarding, likely due to
some lucky assember reordering that give less icache/fetch-misses.
The early-drop improvement is below accuracy levels, can cannot be
trusted.

With GCC 4.9.1:
 1. IP-forwarding: -10.05ns (faster) (+17532 pps)
 2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels

I don't know what to conclude, as the result depend on the compiler
version... but these kind of change do affect performance, and should
be tested/measured.

- --
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Quick eval of Pablo's refactor of __netif_receive_skb_core
==
:Version: 0.1
:Author:  Jesper Dangaard Brouer

Summary
===

Pablo is refactoring __netif_receive_skb_core() into several
functions, to allow for some other upcomming changes.
Daniel Borkmann question if this will affect performance.
Jesper tests this.

The performance effect (of this patch) depend on the Gcc compiler
version.

Two tests:
 1. IP-forwarding (unloaded netfilter modules)
 2. Early drop in iptables "raw" table

With GCC 4.4.7, which does not inline the new functions
(__netif_receive_skb_ingress and __netif_receive_skb_finish) the
performance impact/regression is definitly measurable.

With GCC 4.4.7:
 1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
 2. Early-drop   :  +7.55 ns (slower) (-66577 pps)

With GCC 4.9.1, the new functions gets inlined, thus the refactor
splitup of __netif_receive_skb_core() is basically "cancled".
Strangly there is a small improvement for forwarding, likely due to
some lucky assember reordering that give less icache/fetch-misses.
The early-drop improvement is below accuracy levels, can cannot be
trusted.

With GCC 4.9.1:
 1. IP-forwarding: -10.05ns (faster) (+17532 pps)
 2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels

Setup
=

On host: ivy

Host ivy is the "sink" or DUT (Device Under Test).
 * CPU E5-2695ES @ 2.80GHz

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30

Make a fake route to 198.18.0.0/15 out via eth5

sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5

Disable power saving to get more accurate measurements (see blogpost)::

 $ sudo tuned-adm active
 Current active profile: latency-performance
 Service tuned: enabled, running
 Service ktune: enabled, running

Early drop in raw
-

alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple

On host: dragon
---

Host dragon is the packet generator.
 * 2x CPU E5-2630 0 @ 2.30GHz

Generator NIC: eth8 - ixgbe 10G

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth8  #

Re: [PATCH linux-next 1/4] infiniband/ipoib: fix possible NULL pointer dereference in ipoib_get_iflink

2015-04-15 Thread Jason Gunthorpe

On Wed, Apr 15, 2015 at 09:17:14AM +0300, Erez Shitrit wrote:
> >>+   /* parent interface */
> >>+   if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags))
> >>+   return dev->ifindex;
> >>+
> >>+   /* child/vlan interface */
> >>+   if (!priv->parent)
> >>+   return -1;

> >Like was said for other drivers, I can't see how parent can be null
> >while IPOIB_FLAG_SUBINTERFACE is set. Drop the last if.

> It can, at least for ipoib child interface (AKA "vlan"), you can't
> control the call for that ndo and it can be called before the parent
> was set.

If the ndo can be called before the netdev private structures are fully
prepared then we have another bug, and returning -1 or 0 is not the right
answer anyhow.

For safety, fold this into your patch.

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c 
b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 9fad7b5ac8b9..e62b007adf5d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -58,6 +58,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
/* MTU will be reset when mcast join happens */
priv->dev->mtu   = IPOIB_UD_MTU(priv->max_ib_mtu);
priv->mcast_mtu  = priv->admin_mtu = priv->dev->mtu;
+   priv->parent = ppriv->dev;
set_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags);
 
result = ipoib_set_dev_features(priv, ppriv->ca);
@@ -84,8 +85,6 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct 
ipoib_dev_priv *priv,
goto register_failed;
}
 
-   priv->parent = ppriv->dev;
-
ipoib_create_debug_files(priv->dev);
 
/* RTNL childs don't need proprietary sysfs entries */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] bpf: fix verifier memory corruption

2015-04-15 Thread Alexei Starovoitov


On 4/15/15 8:59 AM, Hannes Frederic Sowa wrote:

On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote:

Due to missing bounds check the DAG pass of the BPF verifier can corrupt
the memory which can cause random crashes during program loading:

[8.449451] BUG: unable to handle kernel paging request at 
[8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
[8.452329] Oops:  [#1] SMP
[8.452329] Call Trace:
[8.452329]  [] bpf_check+0x852/0x2000
[8.452329]  [] bpf_prog_load+0x1e4/0x310
[8.452329]  [] ? might_fault+0x5f/0xb0
[8.452329]  [] SyS_bpf+0x806/0xa30

Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
Signed-off-by: Alexei Starovoitov 
---
Many things need to align for this crash to be seen, yet I managed to hit it.
In my case JA was last insn, 't' was 255 and explored_states array
had 256 elements. I've double checked other similar paths and all seems clean.

  kernel/bpf/verifier.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a28e09c7825d..36508e69e92a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1380,7 +1380,8 @@ peek_stack:
/* tell verifier to check for equivalent states
 * after every call and jump
 */
-   env->explored_states[t + 1] = STATE_LIST_MARK;
+   if (t + 1 < insn_cnt)
+   env->explored_states[t + 1] = STATE_LIST_MARK;
} else {
/* conditional jump with two edges */
ret = push_insn(t, t + 1, FALLTHROUGH, env);


Quick review:

We have env->explored_states[t+1] access in the

 } else {
 /* conditional jump with two edges */
 ret = push_insn(t, t + 1, FALLTHROUGH, env);
 if (ret == 1)
 goto peek_stack;
 else if (ret < 0)
 goto err_free;


 ret = push_insn(t, t + insns[t].off + 1, BRANCH, env);

 if (ret == 1)
 goto peek_stack;
 else if (ret < 0)
 goto err_free;
 }
 } else {


push_insn call. At this point insn[t].off could be 0, no?


insn[t].off can be anything, but the first thing that push_insn()
checks is:
if (w < 0 || w >= env->prog->len)
only then it does:
env->explored_states[w] = STATE_LIST_MARK;
so we're good there.
Though thanks for triple checking :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/17] get rid of the size argument of sock_sendmsg()

2015-04-15 Thread David Miller

From: David Laight 
Date: Wed, 15 Apr 2015 08:37:01 +

> From: David Miller
>> Sent: 14 April 2015 18:56
>> > On Tue, Apr 14, 2015 at 04:25:24PM +, David Laight wrote:
>> >> From: Al Viro
>> >> > Sent: 11 April 2015 22:18
>> >> > it's equal to iov_iter_count(&msg->msg_iter) in all cases
>> >>
>> >> I don't know whether this is guaranteed for iov[] that come from
>> >> userspace.
>> >>
>> >> In any case iov_iter_count() is non-trivial and you don't
>> >> really want to call it when unnecessary.
>> >
>> > Really?
>> >
>> > static inline size_t iov_iter_count(struct iov_iter *i)
>> > {
>> > return i->count;
>> > }
>> 
>> This just made my day.
> 
> I hate accessor functions

I hate people who jump to conclusions and don't actually read the code
in question before commenting. :-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] bpf: fix verifier memory corruption

2015-04-15 Thread Hannes Frederic Sowa

On Di, 2015-04-14 at 15:57 -0700, Alexei Starovoitov wrote:
> Due to missing bounds check the DAG pass of the BPF verifier can corrupt
> the memory which can cause random crashes during program loading:
> 
> [8.449451] BUG: unable to handle kernel paging request at 
> [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
> [8.452329] Oops:  [#1] SMP
> [8.452329] Call Trace:
> [8.452329]  [] bpf_check+0x852/0x2000
> [8.452329]  [] bpf_prog_load+0x1e4/0x310
> [8.452329]  [] ? might_fault+0x5f/0xb0
> [8.452329]  [] SyS_bpf+0x806/0xa30
> 
> Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
> Signed-off-by: Alexei Starovoitov 
> ---
> Many things need to align for this crash to be seen, yet I managed to hit it.
> In my case JA was last insn, 't' was 255 and explored_states array
> had 256 elements. I've double checked other similar paths and all seems clean.
> 
>  kernel/bpf/verifier.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a28e09c7825d..36508e69e92a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1380,7 +1380,8 @@ peek_stack:
>   /* tell verifier to check for equivalent states
>* after every call and jump
>*/
> - env->explored_states[t + 1] = STATE_LIST_MARK;
> + if (t + 1 < insn_cnt)
> + env->explored_states[t + 1] = STATE_LIST_MARK;
>   } else {
>   /* conditional jump with two edges */
>   ret = push_insn(t, t + 1, FALLTHROUGH, env);

Quick review:

We have env->explored_states[t+1] access in the 

} else {
/* conditional jump with two edges */
ret = push_insn(t, t + 1, FALLTHROUGH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;

>>> ret = push_insn(t, t + insns[t].off + 1, BRANCH, env);
if (ret == 1)
goto peek_stack;
else if (ret < 0)
goto err_free;
}
} else {


push_insn call. At this point insn[t].off could be 0, no?

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ip_tunnel: Remove gratuitous skb scrubbing

2015-04-15 Thread Nicolas Dichtel


Le 15/04/2015 15:57, Herbert Xu a écrit :

On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:

[snip]

Subject: skbuff: Do not scrub skb mark within the same name space

The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:

Maybe add a Fixes tag?
Fixes: ea23192e8e57 ("tunnels: harmonize cleanup done on skb on rx path")


harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scurb_packet to preserve the mark field.

nit: s/scurb/scrub

Else it's fine for me.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH iproute2 -next] tc: built-in eBPF exec proxy

2015-04-15 Thread Daniel Borkmann

This work follows upon commit 6256f8c9e45f ("tc, bpf: finalize eBPF
support for cls and act front-end") and takes up the idea proposed by
Hannes Frederic Sowa to spawn a shell (or any other command) that holds
generated eBPF map file descriptors.

File descriptors, based on their id, are being fetched from the same
unix domain socket as demonstrated in the bpf_agent, the shell spawned
via execvpe(2) and the map fds passed over the environment, and thus
are made available to applications in the fashion of std{in,out,err}
for read/write access, for example in case of iproute2's examples/bpf/:

  # env | grep BPF
  BPF_NUM_MAPS=3
  BPF_MAP1=6<- BPF_MAP_ID_QUEUE (id 1)
  BPF_MAP0=5<- BPF_MAP_ID_PROTO (id 0)
  BPF_MAP2=7<- BPF_MAP_ID_DROPS (id 2)

  # ls -la /proc/self/fd
  [...]
  lrwx--. 1 root root 64 Apr 14 16:46 0 -> /dev/pts/4
  lrwx--. 1 root root 64 Apr 14 16:46 1 -> /dev/pts/4
  lrwx--. 1 root root 64 Apr 14 16:46 2 -> /dev/pts/4
  [...]
  lrwx--. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
  lrwx--. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
  lrwx--. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map

The advantage (as opposed to the direct/native usage) is that now the
shell is map fd owner and applications can terminate and easily reattach
to descriptors w/o any kernel changes. Moreover, multiple applications
can easily read/write eBPF maps simultaneously.

To further allow users for experimenting with that, next step is to add
a small helper that can get along with simple data types, so that also
shell scripts can make use of bpf syscall, f.e to read/write into maps.

Generally, this allows for prepopulating maps, or any runtime altering
which could influence eBPF program behaviour (f.e. different run-time
classifications, skb modifications, ...), dumping of statistics, etc.

Reference: http://thread.gmane.org/gmane.linux.network/357471/focus=357860
Suggested-by: Hannes Frederic Sowa 
Signed-off-by: Daniel Borkmann 
---
 ( Stephen, this applies to current net-next branch of iproute2. )

 examples/bpf/bpf_agent.c |  55 ++
 examples/bpf/bpf_prog.c  |  27 +
 tc/Makefile  |   7 ++-
 tc/e_bpf.c   | 147 +++
 tc/f_bpf.c   |   2 +-
 tc/m_action.c|   2 +-
 tc/m_bpf.c   |   2 +-
 tc/tc.c  |   9 +--
 tc/tc_bpf.c  |  98 ---
 tc/tc_bpf.h  |  15 -
 tc/tc_common.h   |   3 +
 tc/tc_exec.c | 109 +++
 tc/tc_filter.c   |   2 +-
 tc/tc_util.h |  16 --
 14 files changed, 454 insertions(+), 40 deletions(-)
 create mode 100644 tc/e_bpf.c
 create mode 100644 tc/tc_exec.c

diff --git a/examples/bpf/bpf_agent.c b/examples/bpf/bpf_agent.c
index 0f481b1..df09e0f 100644
--- a/examples/bpf/bpf_agent.c
+++ b/examples/bpf/bpf_agent.c
@@ -24,6 +24,8 @@
  *   -- Happy eBPF hacking! ;)
  */
 
+#define _GNU_SOURCE
+
 #include 
 #include 
 #include 
@@ -31,6 +33,7 @@
 #include 
 #include 
 #include 
+
 #include 
 #include 
 #include 
@@ -102,6 +105,23 @@ static void bpf_dump_proto(int fd)
printf("\n");
 }
 
+static void bpf_dump_map_data(int *tfd)
+{
+   int i;
+
+   for (i = 0; i < 30; i++) {
+   const int period = 5;
+
+   printf("data, period: %dsec\n", period);
+
+   bpf_dump_drops(tfd[BPF_MAP_ID_DROPS]);
+   bpf_dump_queue(tfd[BPF_MAP_ID_QUEUE]);
+   bpf_dump_proto(tfd[BPF_MAP_ID_PROTO]);
+
+   sleep(period);
+   }
+}
+
 static void bpf_info_loop(int *fds, struct bpf_map_aux *aux)
 {
int i, tfd[BPF_MAP_ID_MAX];
@@ -122,16 +142,22 @@ static void bpf_info_loop(int *fds, struct bpf_map_aux 
*aux)
tfd[aux->ent[i].id] = fds[i];
}
 
-   for (i = 0; i < 30; i++) {
-   int period = 5;
+   bpf_dump_map_data(tfd);
+}
 
-   printf("data, period: %dsec\n", period);
+static void bpf_map_set_env(int *tfd)
+{
+   char key[64], *val;
+   int i;
 
-   bpf_dump_drops(tfd[BPF_MAP_ID_DROPS]);
-   bpf_dump_queue(tfd[BPF_MAP_ID_QUEUE]);
-   bpf_dump_proto(tfd[BPF_MAP_ID_PROTO]);
+   for (i = 0; i < BPF_MAP_ID_MAX; i++) {
+   memset(key, 0, sizeof(key));
+   snprintf(key, sizeof(key), "BPF_MAP%d", i);
 
-   sleep(period);
+   val = secure_getenv(key);
+   assert(val != NULL);
+
+   tfd[i] = atoi(val);
}
 }
 
@@ -186,9 +212,17 @@ int main(int argc, char **argv)
struct sockaddr_un addr;
int fd, ret, i;
 
-   if (argc < 2) {
-   fprintf(stderr, "Usage: %s \n", argv[0]);
-   exit(1);
+   /* When arguments are being passed, we take it as a path
+* to a Unix domain socket, otherwise we gr

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-04-15 Thread Ian Campbell

On Wed, 2015-04-15 at 15:19 +0100, George Dunlap wrote:
> On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
[...]
> > From a networking point of view, the backend is a switch. Is it OK to
> > consider the packet to have been transmitted from the guest point of
> > view once the backend is aware of the packet?
> >
> > This would help justify the skb_orphan() in the frontend.
> 
> This sounds sensible to me, particularly if virtio_net is already doing it.

I also find Malcolm's argument above pretty compelling.

Ian.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 0/6] net: hip04: fix some problem for hip04 drivers

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:02 Ding Tianhong wrote:
> The patches series was used to fix the issues of the hip04 driver, and added
> some optimizations according to some good suggestion. 
> 
> 

Thanks, that looks much better, except for patch 4 that I commented on.

I had at some point sent a patch that is archived at
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-January/318120.html

I believe that one is also still needed, but I have not tested whether it
is correct. Can you have a look at the patch from back then and see if it
works, of if you find something wrong about it?

I'm sending the unmodified patch from then here again for you to apply
or comment. It will have to be rebased on top of your current changes.

Arnd

8<
Subject: [PATCH] net/hip04: refactor interrupt masking

The hip04 ethernet driver currently acknowledges all interrupts directly
in the interrupt handler, and leaves all interrupts except the RX data
enabled the whole time. This causes multiple problems:

- When more packets come in between the original interrupt and the
  NAPI poll function, we will get an extraneous RX interrupt as soon
  as interrupts are enabled again.

- The two error interrupts are by definition combining all errors that
  may have happened since the last time they were handled, but just
  acknowledging the irq without dealing with the cause of the condition
  makes it come back immediately. In particular, when NAPI is intentionally
  stalling the rx queue, this causes a storm of "rx dropped" messages.  

- The access to the 'reg_inten' field in hip_priv is used for serializing
  access, but is in fact racy itself.

To deal with these issues, the driver is changed to only acknowledge
the IRQ at the point when it is being handled. The RX interrupts now get
acked right before reading the number of received frames to keep spurious
interrupts to the absolute minimum without losing interrupts.

As there is no tx-complete interrupt, only an informational tx-dropped
one, we now ack that in the tx reclaim handler, hoping that clearing
the descriptors has also eliminated the error condition.

The only place that reads the reg_inten now just relies on
napi_schedule_prep() to return whether NAPI is active or not, and
the poll() function is then going to ack and reenabled the IRQ if
necessary.

Finally, when we disable interrupts, they are now all disabled
together, in order to simplify the logic and to avoid getting
rx-dropped interrupts when NAPI is in control of the rx queue.

Signed-off-by: Arnd Bergmann 

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 525214ef5984..83773247a368 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -56,6 +56,8 @@
 #define RCV_DROP   BIT(7)
 #define TX_DROPBIT(6)
 #define DEF_INT_ERR(RCV_NOBUF | RCV_DROP | TX_DROP)
+#define DEF_INT_RX (RCV_INT | RCV_NOBUF | RCV_DROP)
+#define DEF_INT_TX (TX_DROP)
 #define DEF_INT_MASK   (RCV_INT | DEF_INT_ERR)

 /* TX descriptor config */
@@ -154,7 +156,6 @@ struct hip04_priv {
unsigned int port;
unsigned int speed;
unsigned int duplex;
-   unsigned int reg_inten;

struct napi_struct napi;
struct net_device *ndev;
@@ -303,17 +304,15 @@ static void hip04_mac_enable(struct net_device *ndev)
val |= GE_RX_PORT_EN | GE_TX_PORT_EN;
writel_relaxed(val, priv->base + GE_PORT_EN);

-   /* clear rx int */
-   val = RCV_INT;
-   writel_relaxed(val, priv->base + PPE_RINT);
+   /* clear prior interrupts */
+   writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);

/* config recv int */
val = GE_RX_INT_THRESHOLD | GE_RX_TIMEOUT;
writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_INT);

/* enable interrupt */
-   priv->reg_inten = DEF_INT_MASK;
-   writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
+   writel_relaxed(DEF_INT_MASK, priv->base + PPE_INTEN);
 }

 static void hip04_mac_disable(struct net_device *ndev)
@@ -322,8 +321,7 @@ static void hip04_mac_disable(struct net_device *ndev)
u32 val;

/* disable int */
-   priv->reg_inten &= ~(DEF_INT_MASK);
-   writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
+   writel_relaxed(0, priv->base + PPE_INTEN);

/* disable tx & rx */
val = readl_relaxed(priv->base + GE_PORT_EN);
@@ -403,6 +401,8 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool 
force)
priv->tx_tail = tx_tail;
smp_wmb(); /* Ensure tx_tail visible to xmit */

+   writel_relaxed(DEF_INT_TX, priv->base + PPE_RINT);
+
 out:
if (pkts_compl || bytes_compl)
netdev_completed_queue(ndev, pkts_compl, bytes_compl);
@@ -458,9 +458,7 @@ static int hip04_mac_start_xmit(

Re: [PATCH net-next 6/6] net: hip04: add ratelimit for rx/tx drops skb

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:08 Ding Tianhong wrote:
> There can be quite a lot of rx/tx drops message and affect
> useful message, so need to ratelimit them to not overwhelm
> logging.
> 
> Signed-off-by: Ding Tianhong 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Arnd Bergmann 
> Cc: Zhangfei Gao 
> Cc: Dan Carpenter 
> Cc: Joe Perches 
> 

Acked-by: Arnd Bergmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes

2015-04-15 Thread George Dunlap

On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley
 wrote:
>>
>> But the main concern here is it basically breaks back pressure.
>>
>> And you do not want this, unless there is no other choice.
>>
>
> virtio_net already use's skb_orphan() in it's transmit path. It seems
> only fair that other virtual network drivers behave in the same way.
>
> There are no easy solutions to decrease the transmit latency for
> netback/netfront. We map the guest memory through to the backend to
> avoid memory copies. The frontend memory can only be freed once the
> network driver has completed transmitting the packet in the backend.
>
> Modern network drivers can be quite slow at freeing the skb's once
> transmitted (the packet is already on the wire as far as they are
> concerned) and this delay is compounded by needing the signal the
> completion of the transmit back to the frontend (by IPI in worst case).
>
> From a networking point of view, the backend is a switch. Is it OK to
> consider the packet to have been transmitted from the guest point of
> view once the backend is aware of the packet?
>
> This would help justify the skb_orphan() in the frontend.

This sounds sensible to me, particularly if virtio_net is already doing it.

 -George
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 4/6] net: hip04: Don't free the tx skb when the buffer is not ready for xmit

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:06 Ding Tianhong wrote:
> @@ -489,6 +487,8 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, 
> struct net_device *ndev)
>  
> /* Ensure tx_head update visible to tx reclaim */
> smp_wmb();
> +   count++;
> +   priv->tx_head = TX_NEXT(tx_head);
>  
> 

Something is wrong here, the comment does not match the code any more, because
the smp_wmb() won't actually make the tx_head update visible.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 3/6] net: hip04: Solve the problem of the skb memory allocation failure

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:05 Ding Tianhong wrote:
> The driver will alloc some skb buffer for hareware queue, but without
> considering the case of memory allocation failure, when memory is low,
> the skb may be null and panic the system, so break the loop when skb is
> null and try to alloc the memory again to fix this problem.
> 
> Signed-off-by: Ding Tianhong 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Arnd Bergmann 
> Cc: Zhangfei Gao 
> Cc: Dan Carpenter 
> Cc: Joe Perches 
> 

Looks reasonable to me, though I haven't done a deep analysis
to see if there are remaining problems in this area.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 5/6] net: hip04: remove the unnecessary check

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:07 Ding Tianhong wrote:
> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
> b/drivers/net/ethernet/hisilicon/hip04_eth.c
> index ff9d19c..a7ab1d9 100644
> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -440,7 +440,7 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool 
> force)
> smp_wmb(); /* Ensure tx_tail visible to xmit */
>  
>  out:
> -   if (pkts_compl || bytes_compl)
> +   if (bytes_compl)
> netdev_completed_queue(ndev, pkts_compl, bytes_compl);
>  
> if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 
> 1)))
> -- 
> 

Acked-by: Arnd Bergmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:04 Ding Tianhong wrote:
> The hip04 ethernet use the big endian for tx and rx, so set desc to
> big endian and remove the unused next_addr.

I don't understand:

> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
> b/drivers/net/ethernet/hisilicon/hip04_eth.c
> index b0a7f03..6473462 100644
> --- a/drivers/net/ethernet/hisilicon/hip04_eth.c
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -132,19 +132,18 @@
>  #define HIP04_MIN_TX_COALESCE_FRAMES   1
>  
>  struct tx_desc {
> -   u32 send_addr;
> -   u32 send_size;
> -   u32 next_addr;
> -   u32 cfg;
> -   u32 wb_addr;
> +   __be32 send_addr;
> +   __be32 send_size;
> +   __be32 cfg;
> +   __be32 wb_addr;
>  } __aligned(64);

I would think this is a hardware structure, does this not break
access to the cfg and wb_addr fields if you remove next_addr?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next 1/6] net: hip04: Set more appropriate value for tx coalesce num

2015-04-15 Thread Arnd Bergmann

On Wednesday 15 April 2015 20:30:03 Ding Tianhong wrote:
> According Arnd's suggestion, the user might set the tx coalesce value
> in large range for different workloads, so we should define them to
> appropriate value.
> 
> Signed-off-by: Ding Tianhong 
> Cc: "David S. Miller" 
> Cc: Eric Dumazet 
> Cc: Arnd Bergmann 
> Cc: Zhangfei Gao 
> Cc: Dan Carpenter 
> Cc: Joe Perches 
> 

Acked-by: Arnd Bergmann 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2 net] bnx2x: Fix busy_poll vs netpoll

2015-04-15 Thread Yuval Mintz

> >
> > BTW, this looks quite generic - isn't it possible to take it out of
> > the driver and push it into the networking core, uniforming it in the 
> > process?
> 
> Yes, this is the plan I have in mind, once net-next is opened again ;)
> 
> Thanks !
> 

I'm not familiar with any real strong test-suite for the busy poll, but I did 
try
running all kinds of netperf connections with various busy_{poll,read} values
and performance seemed reasonable, and verified that netconsole manages
to work.

So for all it's worth,
Tested-by: Yuval Mintz 
N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: ip_tunnel: Remove gratuitous skb scrubbing

2015-04-15 Thread Herbert Xu

On Wed, Apr 15, 2015 at 06:22:29PM +0800, Herbert Xu wrote:
>
> Yes this is better.  I'm currently auditing all the other bits
> that are cleared to see if there is anything else that we should
> preserve for tunneling.

OK the only other thing that we may wish to preserve is secmark.
James, can you confirm whether secmark should be preserved or
cleared for tunnels within the same name space? Up until December
2014 it was preserved but since then it has been cleared.

For the mark here is my final tested patch.

---8<---
Subject: skbuff: Do not scrub skb mark within the same name space

The commit ea23192e8e577dfc51e0f4fc5ca113af334edff9 ("tunnels:
harmonize cleanup done on skb on rx path") broke anyone trying to
use netfilter marking across IPv4 tunnels.  While most of the
fields that are cleared by skb_scrub_packet don't matter, the
netfilter mark must be preserved.

This patch rearranges skb_scurb_packet to preserve the mark field.

Signed-off-by: Herbert Xu 

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3b6e583..a185427 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4124,19 +4124,22 @@ EXPORT_SYMBOL(skb_try_coalesce);
  */
 void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 {
-   if (xnet)
-   skb_orphan(skb);
skb->tstamp.tv64 = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
skb_dst_drop(skb);
-   skb->mark = 0;
skb_sender_cpu_clear(skb);
skb_init_secmark(skb);
secpath_reset(skb);
nf_reset(skb);
nf_reset_trace(skb);
+
+   if (!xnet)
+   return;
+
+   skb_orphan(skb);
+   skb->mark = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);

-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

2015-04-15 Thread George Dunlap

On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet  wrote:
> On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote:
>
>> Is the problem perhaps that netback/netfront delays TX completion?
>> Would it be better to see if that can be addressed properly, so that
>> the original purpose of the patch (fighting bufferbloat) can be
>> achieved while not degrading performance for Xen?  Or at least, so
>> that people get decent perfomance out of the box without having to
>> tweak TCP parameters?
>
> Sure, please provide a patch, that does not break back pressure.
>
> But just in case, if Xen performance relied on bufferbloat, it might be
> very difficult to reach a stable equilibrium : Any small change in stack
> or scheduling might introduce a significant difference in 'raw
> performance'.

So help me understand this a little bit here.  tcp_limit_output_bytes
limits the amount of data allowed to be "in-transit" between a send()
and the wire, is that right?

And so the "bufferbloat" problem you're talking about here are TCP
buffers inside the kernel, and/or buffers in the NIC, is that right?

So ideally, you want this to be large enough to fill the "pipeline"
all the way from send() down to actually getting out on the wire;
otherwise, you'll have gaps in the pipeline, and the machinery won't
be working at full throttle.

And the reason it's a problem is that many NICs now come with large
send buffers; and effectively what happens then is that this makes the
"pipeline" longer -- as the buffer fills up, the time between send()
and the wire is increased.  This increased latency causes delays in
round-trip-times and interferes with the mechanisms TCP uses to try to
determine what the actual sustainable rate of data trasmission is.

By limiting the number of "in-transit" bytes, you make sure that
neither the kernel nor the NIC are going to have packets queues up for
long lengths of time in buffers, and you keep this "pipeline" as close
to the actual minimal length of the pipeline as possible.  And it
sounds like for your 40G NIC, 128k is big enough to fill the pipeline
without unduly making it longer by introducing buffering.

Is that an accurate picture of what you're trying to achieve?

But the problem for xennet (and a number of other drivers), as I
understand it, is that at the moment the "pipeline" itself is just
longer -- it just takes a longer time from the time you send a packet
to the time it actually gets out on the wire.

So it's not actually accurate to say that "Xen performance relies on
bufferbloat".  There's no buffering involved -- the pipeline is just
longer, and so to fill up the pipeline you need more data.

Basically, to maximize throughput while minimizing buffering, for
*any* connection, tcp_limit_output_bytes should ideally be around
(min_tx_latency * max_bandwidth).  For physical NICs, the minimum
latency is really small, but for xennet -- and I'm guessing for a lot
of virtualized cards -- the min_tx_latency will be a lot higher,
requiring a much higher ideal tcp_limit_output value.

Rather than trying to pick a single value which will be good for all
NICs, it seems like it would make more sense to have this vary
depending on the parameters of the NIC.  After all, for NICs that have
low throughput -- say, old 100MiB NICs -- even 128k may still
introduce a significant amount of buffering.

Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case.

Probably the most sustainable solution would be to have the networking
layer keep track of the average and minimum transmit latencies, and
automatically adjust tcp_limit_output_bytes based on that.  (Keeping
the minimum as well as the average because the whole problem with
bufferbloat is that the more data you give it, the longer the apparent
"pipeline" becomes.)

Thoughts?

 -George
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] neighbour.c: Avoid GC directly after state change

2015-04-15 Thread Ulf Samuelsson


No reply on this

in net/core/neighbour.c: neigh_timer_handler I see:

if (state & NUD_REACHABLE) {
if (time_before_eq(now,
   neigh->confirmed + neigh->parms->reachable_time)) {
neigh_dbg(2, "neigh %p is still alive\n", neigh);
next = neigh->confirmed + neigh->parms->reachable_time;
} else if (time_before_eq(now,
  neigh->used +
  NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME))) {
neigh_dbg(2, "neigh %p is delayed\n", neigh);
neigh->nud_state = NUD_DELAY;
neigh->updated = jiffies;
neigh_suspect(neigh);
next = now + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME);
} else {
neigh_dbg(2, "neigh %p is suspected\n", neigh);
neigh->nud_state = NUD_STALE;
neigh->updated = jiffies;
neigh_suspect(neigh);
notify = 1;
}
} else ...

Why is the second test there in the first place?

In the normal case, "neigh->used" does not get updated until the entry 
is found STALE

in the periodic work.

Why not use "neigh->confirmed" and another sysctl variable?

if (state & NUD_REACHABLE) {
if (time_before_eq(now,
   neigh->confirmed + neigh->parms->reachable_time)) {
neigh_dbg(2, "neigh %p is still alive\n", neigh);
next = neigh->confirmed + neigh->parms->reachable_time;
} else if (time_before_eq(now,
- neigh->used +
-  NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME))) {
+neigh->confirmed +
+ NEIGH_VAR(neigh->parms, DELAY_REPROBE_TIME))) {
neigh_dbg(2, "neigh %p is delayed\n", neigh);
neigh->nud_state = NUD_DELAY;
neigh->updated = jiffies;
neigh_suspect(neigh);
next = now + NEIGH_VAR(neigh->parms, DELAY_PROBE_TIME);
} else {
neigh_dbg(2, "neigh %p is suspected\n", neigh);
neigh->nud_state = NUD_STALE;
neigh->updated = jiffies;
neigh_suspect(neigh);
notify = 1;
}
} else ...

if DELAY_REPROBE_TIME is larger than "reachable_time", then
the kernel will send out ARP probes when it is about
to lose communication with a remote machine.
This is what we need.

If it is smaller, then
it will go from REACHABLE to STALE.

The initial value of DELAY_REPROBE_TIME needs to be settable in Kconfig
to allow the selection of functionality.

I am told that setting stuff using sysctl has a performance penalty, when
interfaces are dynamically created and deleted in hundreds.

Best Regards,
Ulf Samuelsson
 


On 04/10/2015 10:26 AM, Ulf Samuelsson wrote:


On 03/12/2015 07:26 PM, David Miller wrote:

I hate changes like this.

By making this a Kconfig options it cannot be dynamic, and every
distribution is going to have to scratch their head and decide
what to set this to.

That's seriously suboptimal.

If you want to change behavior based upon whether userspace is
managing the damn table, make it so the user doing so has to
ask for the new behavior at _RUN TIME_ via the netlink interface
or similar.

Picking the guard time itself at compile time is also undesirable.

And you don't even want a damn timer, what you want is for the
state of the entry to be frozen and for the user to "release"
it by either adjusting the state to something else or marking
in some other way to allow it to be unfrozen and released again.

Why put it to chance with some timeout?  Make things explicit.


The desired functionality is that if communication stops,
you want to send out ARP probes, before the entry is deleted.

The current (pseudo) code of the neigh timer is:

if (state & NUD_REACHABLE) {
if (now <= "confirmed + "reachable_time")) {
... /* We are OK */
} else if (now < "used" + DELAY_PROBE_TIME) {/* Never 
happens */

state = NUD_DELAY;
} else {
state = NUD_STALE;
notify = 1;
}

We never see the state beeing changed from REACHABLE to DELAY,
so the probes are not beeing sent out, instead you always go
from REACHABLE to STALE.

DELAY_PROBE_TIME is set to (5 x HZ) and "used"
seems to be only set by the periodic_work routine
when the neigh entry is in STALE state, and then it is too late.
It is also set by "arp_find" which is used by "broken" devices.

In practice, the second condition: "(now < "used" + DELAY_PROBE_TIME)" 
is never used.

What is the intention of this test?

By adding a new test + parameter, we would get the desired functionality,
and no need to listen for notifications or doing ARP state updates 
from applications.


if (now <= "confirmed + "reachable_time")) {
... /* We are OK */
+else if (now <= "confirmed + "reprobe_time")) {
+   state <= NUD_DELAY;
} else if (now < "used

Re: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread Alexander Duyck




On 04/15/2015 05:44 AM, David Laight wrote:

From: Alexander Duyck

Sent: 10 April 2015 20:56
On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:

+another_round:
+   ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
+   switch (ret) {
+   case NET_RX_SUCCESS:
+   case NET_RX_DROP:
+   break;
+   case __NET_RX_ANOTHER_ROUND:
+   goto another_round;
+   }
rcu_read_unlock();
+
return ret;
  }




Couldn't this just be done as a do while?  It would probably be easier
to read and there wouldn't be any need for the another_round label anymore.


Or an infinite loop with a break at the bottom, as in:
for (;;) {
switch (...) {
case again:
continue;
default:
break;
}
break;
}

David



That is even more complicated.  What I was thinking was
do {
ret = __netif_receive_skb_ingress(skb, pfmemalloc,
  orig_dev);
} while (ret == __NET_RX_ANOTHER_ROUND);

Either that or the switch could just be replaced with a if statement 
since the only case that really goes anywhere is __NET_RX_ANOTHER_ROUND 
and everything else just exits anyway.  I had just suggested a do/while 
since that lets the goto be dropped, but an if would allow for avoiding 
any unnecessary indentation on the call to __netif_receive_skb_ingress.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/4] cxgb4/iw_cxgb4/cxgb4i: remove duplicate definitions

2015-04-15 Thread Varun Prakash

On Mon, Apr 13, 2015 at 10:38:08AM -0600, Jason Gunthorpe wrote:
> On Mon, Apr 13, 2015 at 07:34:23PM +0530, Varun Prakash wrote:
> > define struct ulptx_idata in common header file t4_msg.h
> > to remove duplicate definitions.
> 
> The Infiniband side of this patch looks OK.
> 
> Reviewed-By: Jason Gunthorpe 
> 
> Just some random thoughts on the other patches:
>  - Try and use 'if (IS_ENABLED(CONFIG_XX))' over #ifdef 
>to improve compile test coverage. This would drop a fair number
>of ifdefs.

FCoE specific structures and functions are defined only if 
CONFIG_CHELSIO_T4_FCOE is enabled.

#ifdef CONFIG_CHELSIO_T4_FCOE
void cxgb_fcoe_init_ddp(struct adapter *adap)
{
...
}
#endif

If CONFIG_CHELSIO_T4_FCOE is disabled then following code will
result in build error "implicit declaration of function cxgb_fcoe_init_ddp"

if (IS_ENABLED(CONFIG_CHELSIO_T4_FCOE))
cxgb_fcoe_init_ddp(adap);

>  - Some of the commit message are short, or non existant (ie #4)
>  - Generally, no need for 'static inline' in a .c file, the compiler knows
>what to do.
> 
> Regards,
> Jason
>
Thanks 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread David Laight

From: Alexander Duyck
> Sent: 10 April 2015 20:56
> On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:
> > +another_round:
> > +   ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
> > +   switch (ret) {
> > +   case NET_RX_SUCCESS:
> > +   case NET_RX_DROP:
> > +   break;
> > +   case __NET_RX_ANOTHER_ROUND:
> > +   goto another_round;
> > +   }
> > rcu_read_unlock();
> > +
> > return ret;
> >  }
> >
> >
> 
> Couldn't this just be done as a do while?  It would probably be easier
> to read and there wouldn't be any need for the another_round label anymore.

Or an infinite loop with a break at the bottom, as in:
for (;;) {
switch (...) {
case again:
continue;
default:
break;
}
break;
}

David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/6] net: hip04: Set more appropriate value for tx coalesce num

2015-04-15 Thread Ding Tianhong

According Arnd's suggestion, the user might set the tx coalesce value
in large range for different workloads, so we should define them to
appropriate value.

Signed-off-by: Ding Tianhong 
Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Arnd Bergmann 
Cc: Zhangfei Gao 
Cc: Dan Carpenter 
Cc: Joe Perches 
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index b72d238..b0a7f03 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -126,10 +126,10 @@
 #define DRV_NAME   "hip04-ether"
 #define DRV_VERSION"v1.0"
 
-#define HIP04_MAX_TX_COALESCE_USECS200
-#define HIP04_MIN_TX_COALESCE_USECS100
-#define HIP04_MAX_TX_COALESCE_FRAMES   200
-#define HIP04_MIN_TX_COALESCE_FRAMES   100
+#define HIP04_MAX_TX_COALESCE_USECS10
+#define HIP04_MIN_TX_COALESCE_USECS1
+#define HIP04_MAX_TX_COALESCE_FRAMES   (TX_DESC_NUM - 1)
+#define HIP04_MIN_TX_COALESCE_FRAMES   1
 
 struct tx_desc {
u32 send_addr;
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 2/6] net: hip04: use the big endian for tx and rx desc

2015-04-15 Thread Ding Tianhong

The hip04 ethernet use the big endian for tx and rx, so set desc to
big endian and remove the unused next_addr.

Signed-off-by: Ding Tianhong 
Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Arnd Bergmann 
Cc: Zhangfei Gao 
Cc: Dan Carpenter 
Cc: Joe Perches 
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index b0a7f03..6473462 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -132,19 +132,18 @@
 #define HIP04_MIN_TX_COALESCE_FRAMES   1
 
 struct tx_desc {
-   u32 send_addr;
-   u32 send_size;
-   u32 next_addr;
-   u32 cfg;
-   u32 wb_addr;
+   __be32 send_addr;
+   __be32 send_size;
+   __be32 cfg;
+   __be32 wb_addr;
 } __aligned(64);
 
 struct rx_desc {
-   u16 reserved_16;
-   u16 pkt_len;
-   u32 reserve1[3];
-   u32 pkt_err;
-   u32 reserve2[4];
+   __be16 reserved_16;
+   __be16 pkt_len;
+   __be32 reserve1[3];
+   __be32 pkt_err;
+   __be32 reserve2[4];
 };
 
 struct hip04_priv {
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 3/6] net: hip04: Solve the problem of the skb memory allocation failure

2015-04-15 Thread Ding Tianhong

The driver will alloc some skb buffer for hareware queue, but without
considering the case of memory allocation failure, when memory is low,
the skb may be null and panic the system, so break the loop when skb is
null and try to alloc the memory again to fix this problem.

Signed-off-by: Ding Tianhong 
Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Arnd Bergmann 
Cc: Zhangfei Gao 
Cc: Dan Carpenter 
Cc: Joe Perches 
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 68 --
 1 file changed, 54 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 6473462..7533858 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -131,6 +131,8 @@
 #define HIP04_MAX_TX_COALESCE_FRAMES   (TX_DESC_NUM - 1)
 #define HIP04_MIN_TX_COALESCE_FRAMES   1
 
+#define HIP04_RX_BUFFER_WRITE  16
+
 struct tx_desc {
__be32 send_addr;
__be32 send_size;
@@ -180,6 +182,7 @@ struct hip04_priv {
 
/* written only by tx cleanup */
unsigned int tx_tail cacheline_aligned_in_smp;
+   unsigned int rx_tail cacheline_aligned_in_smp;
 };
 
 static inline unsigned int tx_count(unsigned int head, unsigned int tail)
@@ -187,6 +190,11 @@ static inline unsigned int tx_count(unsigned int head, 
unsigned int tail)
return (head - tail) % (TX_DESC_NUM - 1);
 }
 
+static inline unsigned int rx_count(unsigned int head, unsigned int tail)
+{
+   return (head - tail) % (RX_DESC_NUM - 1);
+}
+
 static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
 {
struct hip04_priv *priv = netdev_priv(ndev);
@@ -363,6 +371,35 @@ static int hip04_set_mac_address(struct net_device *ndev, 
void *addr)
return 0;
 }
 
+static int hip04_alloc_rx_buffers(struct net_device *ndev, int cleaned_count)
+{
+   struct hip04_priv *priv = netdev_priv(ndev);
+   unsigned char *buf;
+   dma_addr_t phys;
+   int i = priv->rx_tail;
+
+   while (cleaned_count) {
+   buf = netdev_alloc_frag(priv->rx_buf_size);
+   if (!buf)
+   break;
+
+   phys = dma_map_single(&ndev->dev, buf,
+ RX_BUF_SIZE, DMA_FROM_DEVICE);
+   if (dma_mapping_error(&ndev->dev, phys))
+   break;
+
+   priv->rx_buf[i] = buf;
+   priv->rx_phys[i] = phys;
+   hip04_set_recv_desc(priv, phys);
+   i = RX_NEXT(i);
+   cleaned_count--;
+   }
+
+   priv->rx_tail = i;
+
+   return 0;
+}
+
 static int hip04_tx_reclaim(struct net_device *ndev, bool force)
 {
struct hip04_priv *priv = netdev_priv(ndev);
@@ -482,8 +519,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int 
budget)
struct sk_buff *skb;
unsigned char *buf;
bool last = false;
-   dma_addr_t phys;
-   int rx = 0;
+   int rx = 0, cleaned_count = 0;
int tx_remaining;
u16 len;
u32 err;
@@ -491,8 +527,10 @@ static int hip04_rx_poll(struct napi_struct *napi, int 
budget)
while (cnt && !last) {
buf = priv->rx_buf[priv->rx_head];
skb = build_skb(buf, priv->rx_buf_size);
-   if (unlikely(!skb))
+   if (unlikely(!skb)) {
net_dbg_ratelimited("build_skb failed\n");
+   goto done;
+   }
 
dma_unmap_single(&ndev->dev, priv->rx_phys[priv->rx_head],
 RX_BUF_SIZE, DMA_FROM_DEVICE);
@@ -519,18 +557,15 @@ static int hip04_rx_poll(struct napi_struct *napi, int 
budget)
rx++;
}
 
-   buf = netdev_alloc_frag(priv->rx_buf_size);
-   if (!buf)
-   goto done;
-   phys = dma_map_single(&ndev->dev, buf,
- RX_BUF_SIZE, DMA_FROM_DEVICE);
-   if (dma_mapping_error(&ndev->dev, phys))
-   goto done;
-   priv->rx_buf[priv->rx_head] = buf;
-   priv->rx_phys[priv->rx_head] = phys;
-   hip04_set_recv_desc(priv, phys);
-
priv->rx_head = RX_NEXT(priv->rx_head);
+
+   cleaned_count = rx_count(priv->rx_head, priv->rx_tail);
+   /* return some buffers to hardware , one at a time is too slow 
*/
+   if (++cleaned_count >= HIP04_RX_BUFFER_WRITE) {
+   hip04_alloc_rx_buffers(ndev, cleaned_count);
+   cleaned_count = 0;
+   }
+
if (rx >= budget)
goto done;
 
@@ -545,6 +580,10 @@ static int hip04_rx_poll(struct napi_struct *napi, int 
budget)
}
napi_complete(napi);
 done:
+   cleaned_count = rx_count(priv->rx_head, priv->rx_tail);
+   if (cleaned_co

[PATCH net-next 5/6] net: hip04: remove the unnecessary check

2015-04-15 Thread Ding Tianhong

Testing bytes_compl should be enough, because there is no way
that pkt_compl could be 0 if bytes_compl is not 0.

Signed-off-by: Ding Tianhong 
Cc: "David S. Miller" 
Cc: Eric Dumazet 
Cc: Arnd Bergmann 
Cc: Zhangfei Gao 
Cc: Dan Carpenter 
Cc: Joe Perches 
---
 drivers/net/ethernet/hisilicon/hip04_eth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c 
b/drivers/net/ethernet/hisilicon/hip04_eth.c
index ff9d19c..a7ab1d9 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -440,7 +440,7 @@ static int hip04_tx_reclaim(struct net_device *ndev, bool 
force)
smp_wmb(); /* Ensure tx_tail visible to xmit */
 
 out:
-   if (pkts_compl || bytes_compl)
+   if (bytes_compl)
netdev_completed_queue(ndev, pkts_compl, bytes_compl);
 
if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1)))
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 123 matches

Mail list logo