Re: [PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset
On 31.10.2018 09:42, Alexey Kodanev wrote: > cb->args[2] can store the pointer to the struct fib6_walker, > allocated in inet6_dump_fib(). On the next loop iteration in > rtnl_dump_all(), 'memset(&cb, 0, sizeof(cb->args))' can reset > that pointer, leaking the memory [1]. > On the second thought we could as well save the state of fib6_walker in inet6_dump_fib() with cb->data. That should fix the leak too. Is it sounds reasonable? Thanks, Alexey > Fix it by calling cb->done, if it is set, before filling 'cb->args' > with zeros. > > Looks like the recent changes in rtnl_dump_all() contributed to > the appearance of this kmemleak [1], commit c63586dc9b3e ("net: > rtnl_dump_all needs to propagate error from dumpit function") > breaks the loop only on an error now. > > [1]: > unreferenced object 0x88001322a200 (size 96): > comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s) > hex dump (first 32 bytes): > 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de > 18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff ..A6..A6 > backtrace: > [<95846b39>] kmem_cache_alloc_trace+0x151/0x220 > [<7d12709f>] inet6_dump_fib+0x68d/0x940 > [<2775a316>] rtnl_dump_all+0x1d9/0x2d0 > [] netlink_dump+0x945/0x11a0 > [<2f43485f>] __netlink_dump_start+0x55d/0x800 > [ ] rtnetlink_rcv_msg+0x4fa/0xa00 > [<9b5761f3>] netlink_rcv_skb+0x29c/0x420 > [<87a1dae1>] rtnetlink_rcv+0x15/0x20 > [<691b703b>] netlink_unicast+0x4e3/0x6c0 > [ ] netlink_sendmsg+0x7f2/0xba0 > [<96d2aa60>] sock_sendmsg+0xba/0xf0 > [<8c1b786f>] __sys_sendto+0x1e4/0x330 > [<19587b3f>] __x64_sys_sendto+0xe1/0x1a0 > [<071f4d56>] do_syscall_64+0x9f/0x300 > [<2737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [<57587684>] 0x > > Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to > 2^32") > Signed-off-by: Alexey Kodanev > --- > net/core/rtnetlink.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c > index f679c7a..314c683 100644 > --- a/net/core/rtnetlink.c > +++ b/net/core/rtnetlink.c > @@ -3362,6 +3362,8 @@ static int rtnl_dump_all(struct sk_buff *skb, struct > netlink_callback *cb) > continue; > > if (idx > s_idx) { > + if (cb->done) > + cb->done(cb); > memset(&cb->args[0], 0, sizeof(cb->args)); > cb->prev_seq = 0; > cb->seq = 0; >
[PATCH net] rtnetlink: invoke 'cb->done' destructor before 'cb->args' reset
cb->args[2] can store the pointer to the struct fib6_walker, allocated in inet6_dump_fib(). On the next loop iteration in rtnl_dump_all(), 'memset(&cb, 0, sizeof(cb->args))' can reset that pointer, leaking the memory [1]. Fix it by calling cb->done, if it is set, before filling 'cb->args' with zeros. Looks like the recent changes in rtnl_dump_all() contributed to the appearance of this kmemleak [1], commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit function") breaks the loop only on an error now. [1]: unreferenced object 0x88001322a200 (size 96): comm "sshd", pid 1484, jiffies 4296032768 (age 1432.542s) hex dump (first 32 bytes): 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de 18 09 41 36 00 88 ff ff 18 09 41 36 00 88 ff ff ..A6..A6 backtrace: [<95846b39>] kmem_cache_alloc_trace+0x151/0x220 [<7d12709f>] inet6_dump_fib+0x68d/0x940 [<2775a316>] rtnl_dump_all+0x1d9/0x2d0 [] netlink_dump+0x945/0x11a0 [<2f43485f>] __netlink_dump_start+0x55d/0x800 [ ] rtnetlink_rcv_msg+0x4fa/0xa00 [<9b5761f3>] netlink_rcv_skb+0x29c/0x420 [<87a1dae1>] rtnetlink_rcv+0x15/0x20 [<691b703b>] netlink_unicast+0x4e3/0x6c0 [ ] netlink_sendmsg+0x7f2/0xba0 [<96d2aa60>] sock_sendmsg+0xba/0xf0 [<8c1b786f>] __sys_sendto+0x1e4/0x330 [<19587b3f>] __x64_sys_sendto+0xe1/0x1a0 [<071f4d56>] do_syscall_64+0x9f/0x300 [<2737577f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<57587684>] 0x Fixes: 1b43af5480c3 ("[IPV6]: Increase number of possible routing tables to 2^32") Signed-off-by: Alexey Kodanev --- net/core/rtnetlink.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index f679c7a..314c683 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3362,6 +3362,8 @@ static int rtnl_dump_all(struct sk_buff *skb, struct netlink_callback *cb) continue; if (idx > s_idx) { + if (cb->done) + cb->done(cb); memset(&cb->args[0], 0, sizeof(cb->args)); cb->prev_seq = 0; cb->seq = 0; -- 1.8.3.1
Re: Fw: [Bug 201423] New: eth0: hw csum failure
On 30.10.2018 12:04, Andre Tomt wrote: On 30.10.2018 11:58, Andre Tomt wrote: On 27.10.2018 23:41, Andre Tomt wrote: On 26.10.2018 13:45, Andre Tomt wrote: On 25.10.2018 19:38, Eric Dumazet wrote: On 10/24/2018 12:41 PM, Andre Tomt wrote: It eventually showed up again with mlx4, on 4.18.16 + fix and also on 4.19. I still do not have a useful packet capture. It is running a torrent client serving up various linux distributions. Have you also applied this fix ? https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 No. I've applied it now to 4.19 and will report back if anything shows up. Just hit it on the simpler server; no VRF, no tunnels, no nat/conntrack. Only a basic stateless nftables ruleset and a vlan netdev (unlikely to be the one triggering this I guess; it has only v4 traffic). I'm currently testing 4.19 with the recomended commit added, plus these to sort out some GRO issues (on a hunch, unsure if related): https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894 and I *think* it is behaving better now? it's not conclusive as it could take a while to trip in this environment but some of the test servers have not shown anything bad in almost 24h. Sorry, s/some of the/none of the I think it is fairly safe to say 4.19 + mlx4 + these 4 commits is OK. At least for my workload. Servers are now 51-61 hours in, no splats. I also added ntp pool traffic to one of them to make things a little more exciting. Not sure what is needed for 4.18, I dont have the mental bandwidth to test that right now. Also no idea about the similar looking mlx5 splats reported elsewhere.
Re: [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
On Tue, Oct 30, 2018 at 07:23:51PM -0700, Richard Cochran wrote: > This collection of automatic variables is getting ugly. May I ask you > to prefix a patch that puts them into reverse Christmas tree before > your changes? (Patch below) Forgot the diff. Here it is... --- diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c index 2012551d93e0..b54b8158ff8a 100644 --- a/drivers/ptp/ptp_chardev.c +++ b/drivers/ptp/ptp_chardev.c @@ -121,18 +121,18 @@ int ptp_open(struct posix_clock *pc, fmode_t fmode) long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg) { - struct ptp_clock_caps caps; - struct ptp_clock_request req; - struct ptp_sys_offset *sysoff = NULL; - struct ptp_sys_offset_precise precise_offset; - struct ptp_pin_desc pd; struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock); + struct ptp_sys_offset_precise precise_offset; + struct system_device_crosststamp xtstamp; struct ptp_clock_info *ops = ptp->info; + struct ptp_sys_offset *sysoff = NULL; + struct ptp_clock_request req; + struct ptp_clock_caps caps; struct ptp_clock_time *pct; + unsigned int i, pin_index; + struct ptp_pin_desc pd; struct timespec64 ts; - struct system_device_crosststamp xtstamp; int enable, err = 0; - unsigned int i, pin_index; switch (cmd) {
Re: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
On Fri, Oct 26, 2018 at 07:13:00PM +0200, Miroslav Lichvar wrote: > The timecounter needs to be updated at least once per ~550 seconds in > order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old > timestamp. > > Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"), > scheduling of delayed work seems to be less accurate and a requested > delay of 540 seconds may actually be longer than 550 seconds. Also, the > PHC may be adjusted to run up to 6% faster than real time and the system > clock up to 10% slower. Shorten the delay to 360 seconds to be sure the > timecounter is updated in time. > > This fixes an issue with HW timestamps on 82580/I350/I354 being off by > ~1100 seconds for few seconds every ~9 minutes. Acked-by: Richard Cochran
Re: [RFC PATCH 3/4] igb: add support for extended PHC gettime
On Fri, Oct 26, 2018 at 06:27:41PM +0200, Miroslav Lichvar wrote: > +static int igb_ptp_gettimex(struct ptp_clock_info *ptp, > + struct ptp_system_timestamp *sts) > +{ > + struct igb_adapter *igb = container_of(ptp, struct igb_adapter, > +ptp_caps); > + struct e1000_hw *hw = &igb->hw; > + unsigned long flags; > + u32 lo, hi; > + u64 ns; > + > + spin_lock_irqsave(&igb->tmreg_lock, flags); > + > + /* 82576 doesn't have SYSTIMR */ > + if (igb->hw.mac.type == e1000_82576) { Instead of if/then/else, can't you follow the pattern of providing different function flavors ... > + ptp_read_system_prets(sts); > + lo = rd32(E1000_SYSTIML); > + ptp_read_system_postts(sts); > + hi = rd32(E1000_SYSTIMH); > + } else { > + ptp_read_system_prets(sts); > + rd32(E1000_SYSTIMR); > + ptp_read_system_postts(sts); > + lo = rd32(E1000_SYSTIML); > + hi = rd32(E1000_SYSTIMH); > + } > + > + /* SYSTIM on I210/I211 counts time in seconds and nanoseconds */ > + if (igb->hw.mac.type == e1000_i210 || igb->hw.mac.type == e1000_i211) { > + sts->phc_ts.tv_sec = hi; > + sts->phc_ts.tv_nsec = lo; > + } else { > + ns = timecounter_cyc2time(&igb->tc, ((u64)hi << 32) | lo); > + sts->phc_ts = ns_to_timespec64(ns); > + } > + > + spin_unlock_irqrestore(&igb->tmreg_lock, flags); > + > + return 0; > +} > + > static int igb_ptp_settime_82576(struct ptp_clock_info *ptp, >const struct timespec64 *ts) > { > @@ -1125,6 +1165,7 @@ void igb_ptp_init(struct igb_adapter *adapter) > adapter->ptp_caps.adjfreq = igb_ptp_adjfreq_82576; > adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576; > adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576; > + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex; ... here? Thanks, Richard > adapter->ptp_caps.settime64 = igb_ptp_settime_82576; > adapter->ptp_caps.enable = igb_ptp_feature_enable; > adapter->cc.read = igb_ptp_read_82576; > @@ -1144,6 +1185,7 @@ void igb_ptp_init(struct igb_adapter *adapter) > adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580; > adapter->ptp_caps.adjtime = igb_ptp_adjtime_82576; > adapter->ptp_caps.gettime64 = igb_ptp_gettime_82576; > + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex; > adapter->ptp_caps.settime64 = igb_ptp_settime_82576; > adapter->ptp_caps.enable = igb_ptp_feature_enable; > adapter->cc.read = igb_ptp_read_82580; > @@ -1172,6 +1214,7 @@ void igb_ptp_init(struct igb_adapter *adapter) > adapter->ptp_caps.adjfine = igb_ptp_adjfine_82580; > adapter->ptp_caps.adjtime = igb_ptp_adjtime_i210; > adapter->ptp_caps.gettime64 = igb_ptp_gettime_i210; > + adapter->ptp_caps.gettimex64 = igb_ptp_gettimex; > adapter->ptp_caps.settime64 = igb_ptp_settime_i210; > adapter->ptp_caps.enable = igb_ptp_feature_enable_i210; > adapter->ptp_caps.verify = igb_ptp_verify_pin;
Re: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
On Fri, Oct 26, 2018 at 06:27:38PM +0200, Miroslav Lichvar wrote: > The other patches add support for the new ioctl to the e1000e, igb, > and ixgbe driver. Tests with few different NICs in different machines > (and PCIe slots) show that: > - with an I219 (e1000e) the measured delay improved from 2500 to 1300 ns > and the error in the measured offset, when compared to cross > timestamping, was reduced by a factor of 5 > - with an I210 (igb) the delay improved from 5100 to 1700 ns > - with an I350 (igb) the delay improved from 2300 to 750 ns > - with an X550 (ixgbe) the delay improved from 1950 to 650 ns Nice work! This is a welcome improvement. Thanks, Richard
Re: [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
On Fri, Oct 26, 2018 at 06:27:39PM +0200, Miroslav Lichvar wrote: > diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c > index 2012551d93e0..1a04c437fd4f 100644 > --- a/drivers/ptp/ptp_chardev.c > +++ b/drivers/ptp/ptp_chardev.c > @@ -124,11 +124,13 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int > cmd, unsigned long arg) > struct ptp_clock_caps caps; > struct ptp_clock_request req; > struct ptp_sys_offset *sysoff = NULL; > + struct ptp_sys_offset_extended *sysoff_extended = NULL; How about a more succinct name, like 'extoff' ? > struct ptp_sys_offset_precise precise_offset; > struct ptp_pin_desc pd; > struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock); > struct ptp_clock_info *ops = ptp->info; > struct ptp_clock_time *pct; > + struct ptp_system_timestamp sts; > struct timespec64 ts; > struct system_device_crosststamp xtstamp; > int enable, err = 0; This collection of automatic variables is getting ugly. May I ask you to prefix a patch that puts them into reverse Christmas tree before your changes? (Patch below) > @@ -211,6 +213,43 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, > unsigned long arg) > err = -EFAULT; > break; > > + case PTP_SYS_OFFSET_EXTENDED: > + if (!ptp->info->gettimex64) { > + err = -EOPNOTSUPP; > + break; > + } > + sysoff_extended = memdup_user((void __user *)arg, > + sizeof(*sysoff_extended)); As pointed out before, this needs to be freed, and > + if (IS_ERR(sysoff_extended)) { > + err = PTR_ERR(sysoff_extended); > + sysoff = NULL; here you meant, sysoff_extended = NULL; > + break; > + } > + if (sysoff_extended->n_samples > PTP_MAX_SAMPLES) { > + err = -EINVAL; > + break; > + } > + > + pct = &sysoff_extended->ts[0]; > + for (i = 0; i < sysoff_extended->n_samples; i++) { > + err = ptp->info->gettimex64(ptp->info, &sts); > + if (err) > + break; > + pct->sec = sts.sys_ts1.tv_sec; > + pct->nsec = sts.sys_ts1.tv_nsec; > + pct++; > + pct->sec = sts.phc_ts.tv_sec; > + pct->nsec = sts.phc_ts.tv_nsec; > + pct++; > + pct->sec = sts.sys_ts2.tv_sec; > + pct->nsec = sts.sys_ts2.tv_nsec; > + pct++; > + } > + if (copy_to_user((void __user *)arg, sysoff_extended, > + sizeof(*sysoff_extended))) > + err = -EFAULT; > + break; > + > case PTP_SYS_OFFSET: > sysoff = memdup_user((void __user *)arg, sizeof(*sysoff)); > if (IS_ERR(sysoff)) { > diff --git a/include/linux/ptp_clock_kernel.h > b/include/linux/ptp_clock_kernel.h > index 51349d124ee5..79321d929925 100644 > --- a/include/linux/ptp_clock_kernel.h > +++ b/include/linux/ptp_clock_kernel.h > @@ -39,6 +39,13 @@ struct ptp_clock_request { > }; > > struct system_device_crosststamp; > + KernelDoc please for this: > +struct ptp_system_timestamp { > + struct timespec64 sys_ts1; > + struct timespec64 phc_ts; > + struct timespec64 sys_ts2; > +}; > + Thanks, Richard
Re: [PATCH net] net: dsa: microchip: initialize mutex before use
On Tue, Oct 30, 2018 at 04:45:49PM -0700, tristram...@microchip.com wrote: > From: Tristram Ha > > Initialize mutex before use. Avoid kernel complaint when > CONFIG_DEBUG_LOCK_ALLOC is enabled. > > Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477") > Signed-off-by: Tristram Ha > --- > drivers/net/dsa/microchip/ksz_common.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/drivers/net/dsa/microchip/ksz_common.c > b/drivers/net/dsa/microchip/ksz_common.c > index 54e0ca6..f6f0662 100644 > --- a/drivers/net/dsa/microchip/ksz_common.c > +++ b/drivers/net/dsa/microchip/ksz_common.c > @@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev) > { > int i; > > - mutex_init(&dev->reg_mutex); > - mutex_init(&dev->stats_mutex); > - mutex_init(&dev->alu_mutex); > - mutex_init(&dev->vlan_mutex); > - > dev->ds->ops = &ksz_switch_ops; > > for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) { > @@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev) > if (dev->pdata) > dev->chip_id = dev->pdata->chip_id; > > + /* mutex is used in next function call. */ > + mutex_init(&dev->reg_mutex); > + mutex_init(&dev->stats_mutex); > + mutex_init(&dev->alu_mutex); > + mutex_init(&dev->vlan_mutex); Hi Tristram The comment is no longer relevant now that all mutexes are initialised. Andrew
[no subject]
This is Mr Ubaithullah Masood from Banco Santander Bank S A Hong Kong. I got your contact during my private search on net..Would you be interested in a business transaction to act as the beneficiary to claim 9.8M USD funds of my deceased client who died intestate ( Without a Will)and my bank wants to confiscate the funds if the funds are not claimed soon. Do get back for more details as this deal is safe and all documentation will be done legally and we will share 50% each. Thanks.
Re: Fw: [Bug 201423] New: eth0: hw csum failure
> On 10/16/2018 06:00 AM, Eric Dumazet wrote: > > On Mon, Oct 15, 2018 at 11:30 PM Andre Tomt wrote: > >> > >> On 15.10.2018 17:41, Eric Dumazet wrote: > >>> On Mon, Oct 15, 2018 at 8:15 AM Stephen Hemminger > Something is changed between 4.17.12 and 4.18, after bisecting the > problem I > got the following first bad commit: > > commit 88078d98d1bb085d72af8437707279e203524fa5 > Author: Eric Dumazet > Date: Wed Apr 18 11:43:15 2018 -0700 > > net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends > > After working on IP defragmentation lately, I found that some large > packets defeat CHECKSUM_COMPLETE optimization because of NIC adding > zero paddings on the last (small) fragment. > > While removing the padding with pskb_trim_rcsum(), we set > skb->ip_summed > to CHECKSUM_NONE, forcing a full csum validation, even if all prior > fragments had CHECKSUM_COMPLETE set. > > We can instead compute the checksum of the part we are trimming, > usually smaller than the part we keep. > > Signed-off-by: Eric Dumazet > Signed-off-by: David S. Miller > > >>> > >>> Thanks for bisecting ! > >>> > >>> This commit is known to expose some NIC/driver bugs. > >>> > >>> Look at commit 12b03558cef6d655d0d394f5e98a6fd07c1f6c0f > >>> ("net: sungem: fix rx checksum support") for one driver needing a fix. > >>> > >>> I assume SKY2_HW_NEW_LE is not set on your NIC ? > >>> > >> > >> I've seen similar on several systems with mlx4 cards when using 4.18.x - > >> that is hw csum failure followed by some backtrace. > >> > >> Only seems to happen on systems dealing with quite a bit of UDP. > >> > > > > Strange, because mlx4 on IPv6+UDP should not use CHECKSUM_COMPLETE, > > but CHECKSUM_UNNECESSARY > > > > I would be nice to track this a bit further, maybe by providing the > > full packet content. > > > >> Example from 4.18.10: > >>> [635607.740574] p0xe0: hw csum failure > >>> [635607.740598] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1 > >>> [635607.740599] Hardware name: Supermicro Super Server/X10SRL-F, BIOS > >>> 2.0b 05/02/2017 > >>> [635607.740599] Call Trace: > >>> [635607.740602] > >>> [635607.740611] dump_stack+0x5c/0x7b > >>> [635607.740617] __skb_gro_checksum_complete+0x9a/0xa0 > >>> [635607.740621] udp6_gro_receive+0x211/0x290 > >>> [635607.740624] ipv6_gro_receive+0x1a8/0x390 > >>> [635607.740627] dev_gro_receive+0x33e/0x550 > >>> [635607.740628] napi_gro_frags+0xa2/0x210 > >>> [635607.740635] mlx4_en_process_rx_cq+0xa01/0xb40 [mlx4_en] > >>> [635607.740648] ? mlx4_cq_completion+0x23/0x70 [mlx4_core] > >>> [635607.740654] ? mlx4_eq_int+0x373/0xc80 [mlx4_core] > >>> [635607.740657] mlx4_en_poll_rx_cq+0x55/0xf0 [mlx4_en] > >>> [635607.740658] net_rx_action+0xe0/0x2e0 > >>> [635607.740662] __do_softirq+0xd8/0x2e5 > >>> [635607.740666] irq_exit+0xb4/0xc0 > >>> [635607.740667] do_IRQ+0x85/0xd0 > >>> [635607.740670] common_interrupt+0xf/0xf > >>> [635607.740671] > >>> [635607.740675] RIP: 0010:cpuidle_enter_state+0xb4/0x2a0 > >>> [635607.740675] Code: 31 ff e8 df a6 ba ff 45 84 f6 74 17 9c 58 0f 1f 44 > >>> 00 00 f6 c4 02 0f 85 d8 01 00 00 31 ff e8 13 81 bf ff fb 66 0f 1f 44 00 > >>> 00 <4c> 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 > >>> [635607.740701] RSP: 0018:a5c206353ea8 EFLAGS: 0246 ORIG_RAX: > >>> ffd9 > >>> [635607.740703] RAX: 8d72ffd20f00 RBX: 00024214f597c5b0 RCX: > >>> 001f > >>> [635607.740703] RDX: 00024214f597c5b0 RSI: 00020780 RDI: > >>> > >>> [635607.740704] RBP: 0004 R08: 002542bfbefa99fa R09: > >>> > >>> [635607.740705] R10: a5c206353e88 R11: 00c5 R12: > >>> af0aaf78 > >>> [635607.740706] R13: 8d72ffd297d8 R14: R15: > >>> 00024214f58c2ed5 > >>> [635607.740709] ? cpuidle_enter_state+0x91/0x2a0 > >>> [635607.740712] do_idle+0x1d0/0x240 > >>> [635607.740715] cpu_startup_entry+0x5f/0x70 > >>> [635607.740719] start_secondary+0x185/0x1a0 > >>> [635607.740722] secondary_startup_64+0xa5/0xb0 > >>> [635607.740731] p0xe0: hw csum failure > >>> [635607.740745] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.18.0-1 #1 > >>> [635607.740746] Hardware name: Supermicro Super Server/X10SRL-F, BIOS > >>> 2.0b 05/02/2017 > >>> [635607.740746] Call Trace: > >>> [635607.740747] > >>> [635607.740750] dump_stack+0x5c/0x7b > >>> [635607.740755] __skb_checksum_complete+0xb8/0xd0 > >>> [635607.740760] __udp6_lib_rcv+0xa6b/0xa70 > >>> [635607.740767] ? nft_do_chain_inet+0x7a/0xd0 [nf_tables] > >>> [635607.740770] ? nft_do_chain_inet+0x7a/0xd0 [nf_tables] > >>> [635607.740774] ip6_input_finish+0xc0/0x460 > >>> [635607.740776] ip6_input+0x2b/0x90 > >>> [635607.740778] ? ip6_rcv_finish+0x110/0x110 > >>> [635607.740780] ipv6_rcv+0x2cd/0x4b0 > >>> [635607
[PATCH net] net: dsa: microchip: initialize mutex before use
From: Tristram Ha Initialize mutex before use. Avoid kernel complaint when CONFIG_DEBUG_LOCK_ALLOC is enabled. Fixes: b987e98e50ab90e5 ("dsa: add DSA switch driver for Microchip KSZ9477") Signed-off-by: Tristram Ha --- drivers/net/dsa/microchip/ksz_common.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 54e0ca6..f6f0662 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -1117,11 +1117,6 @@ static int ksz_switch_init(struct ksz_device *dev) { int i; - mutex_init(&dev->reg_mutex); - mutex_init(&dev->stats_mutex); - mutex_init(&dev->alu_mutex); - mutex_init(&dev->vlan_mutex); - dev->ds->ops = &ksz_switch_ops; for (i = 0; i < ARRAY_SIZE(ksz_switch_chips); i++) { @@ -1206,6 +1201,12 @@ int ksz_switch_register(struct ksz_device *dev) if (dev->pdata) dev->chip_id = dev->pdata->chip_id; + /* mutex is used in next function call. */ + mutex_init(&dev->reg_mutex); + mutex_init(&dev->stats_mutex); + mutex_init(&dev->alu_mutex); + mutex_init(&dev->vlan_mutex); + if (ksz_switch_detect(dev)) return -EINVAL; -- 1.9.1
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: David Ahern Date: Tue, 30 Oct 2018 16:06:46 -0600 > or make the table per namespace. This will increase namespace create/destroy cost, so I'd rather not for something like this.
[RFC PATCH] net: vlan ipsec-xfrm offload
This is an RFC post - not working code. I welcome your comments. If the real device offers IPsec offload, why shouldn't the VLAN device also offer the offload? This essentially becomes a passthru interface to the real device by swapping out the xfrm_state's netdevice reference before calling to the real_dev. Signed-off-by: Shannon Nelson --- net/8021q/vlan_dev.c | 98 1 file changed, 98 insertions(+) diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index ff720f1..a46e056 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -32,6 +32,7 @@ #include #include #include +#include #include "vlan.h" #include "vlanproc.h" @@ -546,6 +547,7 @@ static struct device_type vlan_type = { }; static const struct net_device_ops vlan_netdev_ops; +static const struct xfrmdev_ops vlan_xfrmdev_ops; static int vlan_dev_init(struct net_device *dev) { @@ -553,6 +555,7 @@ static int vlan_dev_init(struct net_device *dev) netif_carrier_off(dev); + /* copy netdev features into list of user selectable features */ /* IFF_BROADCAST|IFF_MULTICAST; ??? */ dev->flags = real_dev->flags & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_MASTER | IFF_SLAVE); @@ -564,6 +567,12 @@ static int vlan_dev_init(struct net_device *dev) NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE | NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC | NETIF_F_ALL_FCOE; +#ifdef CONFIG_XFRM_OFFLOAD + dev->hw_features |= real_dev->hw_features & (NETIF_F_HW_ESP | +NETIF_F_HW_ESP_TX_CSUM | +NETIF_F_GSO_ESP); + dev->xfrmdev_ops = &vlan_xfrmdev_ops; +#endif dev->features |= dev->hw_features | NETIF_F_LLTX; dev->gso_max_size = real_dev->gso_max_size; @@ -767,6 +776,95 @@ static int vlan_dev_get_iflink(const struct net_device *dev) return real_dev->ifindex; } +static int vlan_xdo_state_add(struct xfrm_state *xs) +{ + struct net_device *dev = xs->xso.dev; + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; + int ret; + + /* swap out the vlan dev and put a hold on the real device */ + xs->xso.dev = real_dev; + dev_hold(real_dev); + + ret = real_dev->xfrmdev_ops->xdo_dev_state_add(xs); + + /* if it wasn't successful, release the real_dev */ + if (ret) + dev_put(real_dev); + + /* put dev back before we leave */ + xs->xso.dev = dev; + + return ret; +} + +static void vlan_xdo_state_delete(struct xfrm_state *xs) +{ + struct net_device *dev = xs->xso.dev; + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; + + xs->xso.dev = real_dev; + real_dev->xfrmdev_ops->xdo_dev_state_delete(xs); + xs->xso.dev = dev; +} + +static void vlan_xdo_state_free(struct xfrm_state *xs) +{ + struct net_device *dev = xs->xso.dev; + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; + + /* check for optional interface */ + if (real_dev->xfrmdev_ops->xdo_dev_state_free) { + xs->xso.dev = real_dev; + real_dev->xfrmdev_ops->xdo_dev_state_free(xs); + xs->xso.dev = dev; + } + + /* let go of the real device, we're done with it */ + dev_put(real_dev); +} + +static bool vlan_xdo_offload_ok(struct sk_buff *skb, struct xfrm_state *xs) +{ + struct net_device *dev = xs->xso.dev; + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; + bool ret = true; + + /* check for optional interface */ + if (likely(real_dev->xfrmdev_ops->xdo_dev_offload_ok)) { + xs->xso.dev = real_dev; + ret = real_dev->xfrmdev_ops->xdo_dev_offload_ok(skb, xs); + xs->xso.dev = dev; + } + + if (!ret) + netdev_err(dev, "%s: real_dev %s NAKd the offload\n", + __func__, real_dev->name); + + return ret; +} + +static void vlan_xdo_advance_esn(struct xfrm_state *xs) +{ + struct net_device *dev = xs->xso.dev; + struct net_device *real_dev = vlan_dev_priv(dev)->real_dev; + + /* check for optional interface */ + if (real_dev->xfrmdev_ops->xdo_dev_state_advance_esn) { + xs->xso.dev = real_dev; + real_dev->xfrmdev_ops->xdo_dev_state_advance_esn(xs); + xs->xso.dev = dev; + } +} + +static const struct xfrmdev_ops vlan_xfrmdev_ops = { + .xdo_dev_state_add = vlan_xdo_state_add, + .xdo_dev_state_delete = vlan_xdo_state_delete, + .xdo_dev_state_free = vlan_xdo_state_free, + .xdo_dev_offload_ok = vlan_xdo_offload_ok, + .xdo_dev_state_advance_esn = vlan_xdo_advance_esn, +}; + static const struct ethtool_ops vlan_ethtool_ops = { .get_link
Re: [PATCH] bpf: tcp_bpf_recvmsg should return EAGAIN when nonblocking and no data
On 10/29/2018 08:31 PM, John Fastabend wrote: > We return 0 in the case of a nonblocking socket that has no data > available. However, this is incorrect and may confuse applications. > After this patch we do the correct thing and return the error > EAGAIN. > > Quoting return codes from recvmsg manpage, > > EAGAIN or EWOULDBLOCK > The socket is marked nonblocking and the receive operation would > block, or a receive timeout had been set and the timeout expired > before data was received. > > Signed-off-by: John Fastabend Applied to bpf, thanks!
Re: [PATCH bpf] tools/bpf: add unlimited rlimit for flow_dissector_load
On 10/29/2018 10:56 PM, Yonghong Song wrote: > On our test machine, bpf selftest test_flow_dissector.sh failed > with the following error: > # ./test_flow_dissector.sh > bpffs not mounted. Mounting... > libbpf: failed to create map (name: 'jmp_table'): Operation not permitted > libbpf: failed to load object 'bpf_flow.o' > ./flow_dissector_load: bpf_prog_load bpf_flow.o > selftests: test_flow_dissector [FAILED] > > Let us increase the rlimit to remove the above map > creation failure. > > Signed-off-by: Yonghong Song Applied to bpf, thanks!
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
On 10/30/18 12:31 PM, David Miller wrote: > From: Jeff Barnhill <0xeff...@gmail.com> > Date: Tue, 30 Oct 2018 07:10:58 -0400 > >> I originally started implementing it the way you suggested; however, >> it seemed to complicate management of that structure because it isn't >> currently using rcu. Also, assuming that can be worked out, where >> would I get the net from? Would I need to store a copy in ifcaddr6, >> or is there some way to access it during ipv6_chk_acast_addr()? It >> seems that if I don't add a copy of net, but instead access it through >> aca_rt(?), then freeing the ifcaddr6 memory becomes problematic >> (detaching it from idev, while read_rcu may still be accessing it). >> On Mon, Oct 29, 2018 at 11:32 PM David Miller wrote: > > I don't think converting the structure over to RCU, especially because > all of the read paths (everything leading to ipv6_chk_acast_dev()) are > taking RCU locks already. > > And I cannot understand how having _two_ structures to manage a piece > of information can be less complicated than just one. > > You can add a backpointer to the 'idev' in ifacaddr6 to get at the > network namespace. You don't even need to do additional reference > counting because the idev->ac_list is always purged before an idev > is destroyed. > or make the table per namespace.
[PATCH iproute2-next] ip rule: Add ipproto and port range to filter list
From: David Ahern Allow ip rule dumps and flushes to filter based on ipproto, sport and dport. Example: $ ip ru ls ipproto udp 99: from all to 8.8.8.8 ipproto udp dport 53 lookup 1001 $ ip ru ls dport 53 99: from all to 8.8.8.8 ipproto udp dport 53 lookup 1001 Signed-off-by: David Ahern --- ip/iprule.c | 66 + 1 file changed, 66 insertions(+) diff --git a/ip/iprule.c b/ip/iprule.c index a85a43904e6e..0713694d4a49 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -78,6 +78,9 @@ static struct inet_prefix dst; int protocol; int protocolmask; + struct fib_rule_port_range sport; + struct fib_rule_port_range dport; + __u8 ipproto; } filter; static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb) @@ -174,6 +177,39 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) return false; } + if (filter.ipproto) { + __u8 ipproto = 0; + + if (tb[FRA_IP_PROTO]) + ipproto = rta_getattr_u8(tb[FRA_IP_PROTO]); + if (filter.ipproto != ipproto) + return false; + } + + if (filter.sport.start) { + const struct fib_rule_port_range *r; + + if (!tb[FRA_SPORT_RANGE]) + return false; + + r = RTA_DATA(tb[FRA_SPORT_RANGE]); + if (r->start != filter.sport.start || + r->end != filter.sport.end) + return false; + } + + if (filter.dport.start) { + const struct fib_rule_port_range *r; + + if (!tb[FRA_DPORT_RANGE]) + return false; + + r = RTA_DATA(tb[FRA_DPORT_RANGE]); + if (r->start != filter.dport.start || + r->end != filter.dport.end) + return false; + } + table = frh_get_table(frh, tb); if (filter.tb > 0 && filter.tb ^ table) return false; @@ -607,6 +643,36 @@ static int iprule_list_flush_or_save(int argc, char **argv, int action) filter.protocolmask = 0; } filter.protocol = prot; + } else if (strcmp(*argv, "ipproto") == 0) { + int ipproto; + + NEXT_ARG(); + ipproto = inet_proto_a2n(*argv); + if (ipproto < 0) + invarg("Invalid \"ipproto\" value\n", *argv); + filter.ipproto = ipproto; + } else if (strcmp(*argv, "sport") == 0) { + struct fib_rule_port_range r; + int ret = 0; + + NEXT_ARG(); + ret = sscanf(*argv, "%hu-%hu", &r.start, &r.end); + if (ret == 1) + r.end = r.start; + else if (ret != 2) + invarg("invalid port range\n", *argv); + filter.sport = r; + } else if (strcmp(*argv, "dport") == 0) { + struct fib_rule_port_range r; + int ret = 0; + + NEXT_ARG(); + ret = sscanf(*argv, "%hu-%hu", &r.start, &r.end); + if (ret == 1) + r.end = r.start; + else if (ret != 2) + invarg("invalid dport range\n", *argv); + filter.dport = r; } else{ if (matches(*argv, "dst") == 0 || matches(*argv, "to") == 0) { -- 2.11.0
[PATCH iproute2] ip rule: Require at least one argument for add
From: David Ahern 'ip rule add' with no additional arguments just adds another rule for the main table - which exists by default. Require at least 1 argument similar to delete. Signed-off-by: David Ahern --- ip/iprule.c | 5 + 1 file changed, 5 insertions(+) diff --git a/ip/iprule.c b/ip/iprule.c index d89d808d8909..b465a80785b1 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -691,6 +691,11 @@ static int iprule_modify(int cmd, int argc, char **argv) }; if (cmd == RTM_NEWRULE) { + if (argc == 0) { + fprintf(stderr, + "\"ip rule add\" requires arguments.\n"); + return -1; + } req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL; req.frh.action = FR_ACT_TO_TBL; } -- 2.11.0
[PATCH iproute2] ip rule: Honor filter arguments on flush
From: David Ahern 'ip ru flush' currently removes all rules with priority > 0 regardless of any other command line arguments passed in. Update flush_rule to call filter_nlmsg to determine if the rule should be flushed or not. This enables rule flushing such as 'ip ru flush table 1001' and 'ip ru flush pref 99'. Signed-off-by: David Ahern --- ip/iprule.c | 5 + 1 file changed, 5 insertions(+) diff --git a/ip/iprule.c b/ip/iprule.c index b465a80785b1..a85a43904e6e 100644 --- a/ip/iprule.c +++ b/ip/iprule.c @@ -461,6 +461,7 @@ static int flush_rule(struct nlmsghdr *n, void *arg) struct fib_rule_hdr *frh = NLMSG_DATA(n); int len = n->nlmsg_len; struct rtattr *tb[FRA_MAX+1]; + int host_len = -1; len -= NLMSG_LENGTH(sizeof(*frh)); if (len < 0) @@ -468,6 +469,10 @@ static int flush_rule(struct nlmsghdr *n, void *arg) parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len); + host_len = af_bit_len(frh->family); + if (!filter_nlmsg(n, tb, host_len)) + return 0; + if (tb[FRA_PROTOCOL]) { __u8 protocol = rta_getattr_u8(tb[FRA_PROTOCOL]); -- 2.11.0
RE: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem
> Could you check on your side that applying this on top of your patch, your > scenario is still working? > > diff --git a/drivers/net/ethernet/cadence/macb_main.c > b/drivers/net/ethernet/cadence/macb_main.c > index 1d86b4d5645a..e1347d6d1b50 100644 > --- a/drivers/net/ethernet/cadence/macb_main.c > +++ b/drivers/net/ethernet/cadence/macb_main.c > @@ -1702,12 +1702,8 @@ static int macb_pad_and_fcs(struct sk_buff **skb, > struct net_device *ndev) > *skb = nskb; > } > > - if (padlen) { > - if (padlen >= ETH_FCS_LEN) > - skb_put_zero(*skb, padlen - ETH_FCS_LEN); > - else > - skb_trim(*skb, ETH_FCS_LEN - padlen); > - } > + if (padlen > ETH_FCS_LEN) > + skb_put_zero(*skb, padlen - ETH_FCS_LEN); I think it is okay but I need to check all paths are covered. > It was reported in [1] that UDP checksum is offloaded to hardware no matter > the application previously computed it. > > The code should be executed only for packets that has checksum computed > by > applications ((*skb)->ip_summed != CHECKSUM_PARTIAL). The idea was to > not > recompute checksum for packets with checksum already computed. To do > so, > while hardware checksum is enabled (NETIF_F_HW_CSUM), TX_NOCRC bit > should > be set on buffer descriptor. But to do so, packets must have a minimum size > of 64 and FCS to be computed. > > The NETIF_F_HW_CSUM check was placed there because the issue > described in > [1] is reproducible because hardware checksum is enabled and overrides the > checksum provided by applications. > > [1] https://www.spinics.net/lists/netdev/msg505065.html I understand the issue now. It is weird that the transmit descriptor does not have direct control over turning on checksum generation or not, but it wastes 3 bits returning the error codes of such generation. What can the driver do with such information? In my opinion then hardware transmit checksumming cannot be supported In Linux. > > NETIF_F_SG is not enabled in the MAC I used, so enabling > NETIF_IF_HW_CSUM > > is rather pointless. With the padding code the transmit throughput cannot > get > > higher than 100Mbps in a gigabit connection. > > > > I would recommend to add this option to disable manual padding in one of > those > > macb_config structures. > > In this way the user would have to know from the beginning what kind of > packets are used. > The kernel already does a good job of calculating checksum. Using hardware to do that does not improve performance much. Alternative is to use ethtool to disable hardware tx checksum so that software can intentionally send wrong checksums.
Re: [PATCH v4.14-stable] sch_netem: restore skb->dev after dequeuing from the rbtree
Greg, On Thu, Oct 18, 2018 at 03:43:48PM -0700, David Miller wrote: > From: Christoph Paasch > Date: Thu, 18 Oct 2018 13:38:40 -0700 > > > Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got > > backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the > > v4.14.x-tree. > > > > However, the backport does not include the changes in sch_netem.c > > > > We need these, as otherwise the skb->dev pointer is not set when > > dequeueing from the netem rbtree, resulting in a panic: > ... > > Fixes: 6b921536f170 ("net: sk_buff rbnode reorg") > > Cc: Stephen Hemminger > > Cc: Eric Dumazet > > Cc: Soheil Hassas Yeganeh > > Cc: Wei Wang > > Cc: Willem de Bruijn > > Signed-off-by: Christoph Paasch > > --- > > > > Notes: > > This patch should only make it into v4.14-stable as that's the only > > branch where > > the offending commit has been backported to. > > Greg, please queue up. Are you planing to queue this one ? Looks to me it was a miss on the backport. It seams that the backport was touching different files, and missed the change on net/sched/sch_netem.c. So, to me, even if this patch may not follow the strictly the rules of stable, as it is not a patch in upstream, seams to be a needed change, even if it is specific to stable linux-4.14.y. > -- All the best, Eduardo Valentin
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
On Tue, Oct 30, 2018 at 11:59 AM Cong Wang wrote: > I wonder how compiler recognizes it as "never fail" when marked with > __must_check. __must_check means that you can not ignore the return value of a function. Here we do use the return value. Also prior code was not checking skb->length so really my patch does add explicit check if in the future skb->len gets wrong after a bug is added in this driver. (A NULL deref will trap the bug, instead of potentially reading garbage from skb->data)
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
On Tue, Oct 30, 2018 at 11:42 AM Eric Dumazet wrote: > > On Tue, Oct 30, 2018 at 11:09 AM Cong Wang wrote: > > > At least skb_header_pointer() is marked as __must_check, I don't see > > you check its return value here. > > This can not fail here. > > skb->length must be above 14+4 at this point. Never say it is wrong, just saying what compiler thinks. > > My compiler seems to be okay with that. I wonder how compiler recognizes it as "never fail" when marked with __must_check.
Re: [Patch net] net: make pskb_trim_rcsum_slow() robust
On Mon, Oct 29, 2018 at 8:08 PM Eric Dumazet wrote: > > > > On 10/29/2018 07:41 PM, Cong Wang wrote: > > On Mon, Oct 29, 2018 at 7:25 PM Eric Dumazet wrote: > >> > >> > >> > >> On 10/29/2018 07:21 PM, Cong Wang wrote: > >>> On Mon, Oct 29, 2018 at 7:14 PM Eric Dumazet > >>> wrote: > > Would not it be simpler to set ip_summed to CHECKSUM_NONE (no need to > save old_csum) ? > >>> > >>> For !CHECKSUM_COMPLETE, ip_summed should be untouched, right? > >>> > >>> If you mean only setting to CHECKSUM_NONE for CHECKSUM_COMPLETE case, > >>> the end result may not be simpler. > >> > >> I meant to reinstate what was there before my patch in this error case > >> > >>if (skb->ip_summed == CHECKSUM_COMPLETE) > >>skb->ip_summed = CHECKSUM_NONE; > >> > >> That would only be run in error (quite unlikely) path, instead of saving > >> old_csum in all cases. > > > > I know your point, however, I am not sure that is a desired behavior. > > > > On failure, I think the whole skb should be restored to its previous state > > before entering this function, changing it to CHECKSUM_NONE on failure > > is inconsistent with success case. > > > > Before my patch, we were changing skb->ip_summed to CHECKSUM_NONE, > so why suddenly we need to be consistent ? That is because setting it to CHECKSUM_NONE _was_ how the success case works and nothing _was_ needed for failure case. You changed how we handle checksum for success case, it is why we need to change for the failure case too. > > In any case, ip_check_defrag() should really drop this skb, as for other > allocation > failures (like skb_share_check()), if really we want consistency. I have the same feeling, just not brave enough to change the logic of ip_check_defrag() where pskb_may_pull() failure is treated in a same way.
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
From: Cong Wang Date: Tue, 30 Oct 2018 11:09:21 -0700 > On Tue, Oct 30, 2018 at 12:57 AM Eric Dumazet wrote: >> -static __be32 mlx5e_get_fcs(struct sk_buff *skb) >> +static u32 mlx5e_get_fcs(const struct sk_buff *skb) >> { >> - int last_frag_sz, bytes_in_prev, nr_frags; >> - u8 *fcs_p1, *fcs_p2; >> - skb_frag_t *last_frag; >> - __be32 fcs_bytes; >> + const void *fcs_bytes; >> + u32 _fcs_bytes; >> >> - if (!skb_is_nonlinear(skb)) >> - return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN); >> + fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN, >> + ETH_FCS_LEN, &_fcs_bytes); > > At least skb_header_pointer() is marked as __must_check, I don't see > you check its return value here. In this case we reasonably know it is guaranteed to succeed though. We know skb->len is non-zero and we are asking for the skb->len - 1 byte or something like that.
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
On Tue, Oct 30, 2018 at 11:09 AM Cong Wang wrote: > At least skb_header_pointer() is marked as __must_check, I don't see > you check its return value here. This can not fail here. skb->length must be above 14+4 at this point. My compiler seems to be okay with that.
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill <0xeff...@gmail.com> Date: Tue, 30 Oct 2018 07:10:58 -0400 > I originally started implementing it the way you suggested; however, > it seemed to complicate management of that structure because it isn't > currently using rcu. Also, assuming that can be worked out, where > would I get the net from? Would I need to store a copy in ifcaddr6, > or is there some way to access it during ipv6_chk_acast_addr()? It > seems that if I don't add a copy of net, but instead access it through > aca_rt(?), then freeing the ifcaddr6 memory becomes problematic > (detaching it from idev, while read_rcu may still be accessing it). > On Mon, Oct 29, 2018 at 11:32 PM David Miller wrote: I don't think converting the structure over to RCU, especially because all of the read paths (everything leading to ipv6_chk_acast_dev()) are taking RCU locks already. And I cannot understand how having _two_ structures to manage a piece of information can be less complicated than just one. You can add a backpointer to the 'idev' in ifacaddr6 to get at the network namespace. You don't even need to do additional reference counting because the idev->ac_list is always purged before an idev is destroyed.
Re: [PATCH net] net/mlx4_en: add a missing include
From: Eric Dumazet Date: Tue, 30 Oct 2018 00:18:12 -0700 > Abdul Haleem reported a build error on ppc : > > drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct > iphdr` declared inside parameter list [enabled by default] >struct iphdr *iph) > ^ > drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is > only this definition or declaration, which is probably not what you want > [enabled by default] > drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function > get_fixed_ipv4_csum: > drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing > pointer to incomplete type > __u8 ipproto = iph->protocol; > ^ > > Fixes: 55469bc6b577 ("drivers: net: remove inclusion when > not needed") > Signed-off-by: Eric Dumazet > Reported-by: Abdul Haleem Applied, thanks Eric.
Re: [PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
On Tue, Oct 30, 2018 at 12:57 AM Eric Dumazet wrote: > -static __be32 mlx5e_get_fcs(struct sk_buff *skb) > +static u32 mlx5e_get_fcs(const struct sk_buff *skb) > { > - int last_frag_sz, bytes_in_prev, nr_frags; > - u8 *fcs_p1, *fcs_p2; > - skb_frag_t *last_frag; > - __be32 fcs_bytes; > + const void *fcs_bytes; > + u32 _fcs_bytes; > > - if (!skb_is_nonlinear(skb)) > - return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN); > + fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN, > + ETH_FCS_LEN, &_fcs_bytes); At least skb_header_pointer() is marked as __must_check, I don't see you check its return value here.
Re: Latest net-next kernel 4.19.0+
On Tue, Oct 30, 2018 at 10:50 AM Eric Dumazet wrote: > > > > On 10/30/2018 10:32 AM, Cong Wang wrote: > > > Unlike Pawel's case, we don't use vlan at all, maybe this is why we see > > it much less frequently than Pawel. > > > > Also, it is probably not specific to mlx5, as there is another report which > > is probably a non-mlx5 driver. > > Not sure if you provided a stack trace ? I said it is the same with Pawel's. Here it is anyway: [ 3731.075989] eth0: hw csum failure [ 3731.079316] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 4.14.74.x86_64 #1 [ 3731.086703] Hardware name: Wiwynn F4WW/Y 300-0284/F4WW MAIN BOARD, BIOS F4WWP02 10/19/2018 [ 3731.094961] Call Trace: [ 3731.097408] [ 3731.099432] dump_stack+0x46/0x59 [ 3731.102751] __skb_checksum_complete+0xb8/0xd0 [ 3731.107194] tcp_v4_rcv+0x116/0xa30 [ 3731.110688] ip_local_deliver_finish+0x5d/0x1f0 [ 3731.115218] ip_local_deliver+0x6b/0xe0 [ 3731.119056] ? ip_rcv_finish+0x400/0x400 [ 3731.122973] ip_rcv+0x287/0x360 [ 3731.126112] ? inet_del_offload+0x40/0x40 [ 3731.130124] __netif_receive_skb_core+0x404/0xc10 [ 3731.134831] ? netif_receive_skb_internal+0x34/0xd0 [ 3731.139709] netif_receive_skb_internal+0x34/0xd0 [ 3731.144415] napi_gro_receive+0xb8/0xe0 [ 3731.148271] mlx5e_handle_rx_cqe_mpwrq+0x4e3/0x7f0 [mlx5_core] [ 3731.154099] ? enqueue_entity+0x103/0x7f0 [ 3731.158114] mlx5e_poll_rx_cq+0xba/0x850 [mlx5_core] [ 3731.163080] mlx5e_napi_poll+0x91/0x290 [mlx5_core] [ 3731.167955] net_rx_action+0x14a/0x3e0 [ 3731.171707] ? credit_entropy_bits+0x23d/0x260 [ 3731.176153] __do_softirq+0xe2/0x2c3 [ 3731.179734] irq_exit+0xbc/0xd0 [ 3731.182878] do_IRQ+0x89/0xd0 [ 3731.185851] common_interrupt+0x7a/0x7a [ 3731.189690] [ 3731.191799] RIP: 0010:cpuidle_enter_state+0xa6/0x2d0 [ 3731.196761] RSP: 0018:bb950c6f7eb0 EFLAGS: 0246 ORIG_RAX: ff60 [ 3731.204328] RAX: 9fe25fbe14c0 RBX: 0364b57553af RCX: 001f [ 3731.211459] RDX: 20c49ba5e353f7cf RSI: 68294248f469 RDI: [ 3731.218583] RBP: db7d003c3300 R08: c3be R09: 8612 [ 3731.225709] R10: bb950c6f7e98 R11: c3be R12: 0003 [ 3731.232841] R13: 912c9d18 R14: R15: 0364b396207a [ 3731.239968] do_idle+0x166/0x1a0 [ 3731.243199] cpu_startup_entry+0x6f/0x80 [ 3731.247128] start_secondary+0x19c/0x1f0 [ 3731.251052] secondary_startup_64+0xa5/0xb0 > > Have you tried IPv6 frags maybe ? > We have no IPv6 traffic. I asked people to try to generate IPv4 fragment traffic to see if it would be more reproducible, no progress yet.
Re: Latest net-next kernel 4.19.0+
On 10/30/2018 10:32 AM, Cong Wang wrote: > Unlike Pawel's case, we don't use vlan at all, maybe this is why we see > it much less frequently than Pawel. > > Also, it is probably not specific to mlx5, as there is another report which > is probably a non-mlx5 driver. Not sure if you provided a stack trace ? Have you tried IPv6 frags maybe ?
Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
On Tue, 30 Oct 2018 10:34:45 -0600 David Ahern wrote: > A more flexible approach is to use format strings to allow users to > customize the output order and whitespace as well. So for ss and your > column list (winging it here): > > netid = %N > state = %S > recv Q = %Qr > send Q = %Qs > local address = %Al > lport port = %Pl > remote address = %Ar > remote port= %Pr > process data = %p > ... > > then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n" I like the idea indeed, but I see two issues with ss: - the current column abstraction is rather lightweight, things are already buffered in the defined column order so we don't have to jump back and forth in the buffer while rendering. Doing that needs some extra care to avoid a performance hit, but it's probably doable, I can put that on my to-do list - how would you model automatic spacing in a format string? Should we support width specifiers? Disable automatic spacing if a format string is given? It might even make sense to allow partial automatic spacing with a special character in the format string, that is: "%S.%Qr.%Qs %Al:%Pl %Ar:%Pr %p\n" would mean "align everything to the right, distribute remaining whitespace between %S, %Qr and %Qs". But it looks rather complicated at a glance. -- Stefano
Re: Latest net-next kernel 4.19.0+
On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet wrote: > > > > On 10/30/2018 01:09 AM, Paweł Staszewski wrote: > > > > > > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: > >> > >> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: > >> > >>> Indeed this is a bug. I would expect it to produce frequent errors > >>> though as many odd-length > >>> packets would trigger it. Do you have RXFCS? Regardless, how > >>> frequently do you see the problem? > >>> > >> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to > >> CHECKSUM_NONE > >> > >> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the > >> bug you fixed. > >> > >> So we now need to also fix mlx5. > >> > >> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned > >> earlier, > >> plus __get_unaligned_cpu32() as you hinted. > >> > >> > >> > >> > > > > No RXFCS Same with Pawel, RXFCS is disabled by default. > > > > And this trace is rly frequently like once per 3/4 seconds > > like below: > > [28965.776864] vlan1490: hw csum failure > > Might be vlan related. Unlike Pawel's case, we don't use vlan at all, maybe this is why we see it much less frequently than Pawel. Also, it is probably not specific to mlx5, as there is another report which is probably a non-mlx5 driver. Thanks.
[RFC PATCH v3 10/10] selftests: add functionals test for UDP GRO
Extends the existing udp programs to allow checking for proper GRO aggregation/GSO size, and run the tests via a shell script, using a veth pair with XDP program attached to trigger the GRO code path. rfc v2 -> rfc v3: - add missing test program options documentation - fix sporatic test failures (receiver faster than sender) Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/Makefile | 2 +- tools/testing/selftests/net/udpgro.sh | 147 ++ tools/testing/selftests/net/udpgro_bench.sh | 8 +- tools/testing/selftests/net/udpgso_bench.sh | 2 +- tools/testing/selftests/net/udpgso_bench_rx.c | 123 +-- tools/testing/selftests/net/udpgso_bench_tx.c | 22 ++- 6 files changed, 281 insertions(+), 23 deletions(-) create mode 100755 tools/testing/selftests/net/udpgro.sh diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index ac999354af54..a8a0d256aafb 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -7,7 +7,7 @@ CFLAGS += -I../../../../usr/include/ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh -TEST_PROGS += udpgro_bench.sh +TEST_PROGS += udpgro_bench.sh udpgro.sh TEST_PROGS_EXTENDED := in_netns.sh TEST_GEN_FILES = socket TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh new file mode 100755 index ..3f12b72a3568 --- /dev/null +++ b/tools/testing/selftests/net/udpgro.sh @@ -0,0 +1,147 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Run a series of udpgro functional tests. + +readonly PEER_NS="ns-peer-$(mktemp -u XX)" + +cleanup() { + local -r jobs="$(jobs -p)" + local -r ns="$(ip netns list|grep $PEER_NS)" + + [ -n "${jobs}" ] && kill -1 ${jobs} 2>/dev/null + [ -n "$ns" ] && ip netns del $ns 2>/dev/null +} +trap cleanup EXIT + +cfg_veth() { + ip netns add "${PEER_NS}" + ip -netns "${PEER_NS}" link set lo up + ip link add type veth + ip link set dev veth0 up + ip addr add dev veth0 192.168.1.2/24 + ip addr add dev veth0 2001:db8::2/64 nodad + + ip link set dev veth1 netns "${PEER_NS}" + ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24 + ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad + ip -netns "${PEER_NS}" link set dev veth1 up +} + +run_one() { + # use 'rx' as separator between sender args and receiver args + local -r all="$@" + local -r tx_args=${all%rx*} + local -r rx_args=${all#*rx} + + cfg_veth + + ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \ + echo "ok" || \ + echo "failed" & + + # Hack: let bg programs complete the startup + sleep 0.1 + ./udpgso_bench_tx ${tx_args} + wait $(jobs -p) +} + +run_test() { + local -r args=$@ + + printf " %-40s" "$1" + ./in_netns.sh $0 __subprocess $2 rx -G -r -x veth1 $3 +} + +run_one_nat() { + # use 'rx' as separator between sender args and receiver args + local addr1 addr2 pid family="" ipt_cmd=ip6tables + local -r all="$@" + local -r tx_args=${all%rx*} + local -r rx_args=${all#*rx} + + if [[ ${tx_args} = *-4* ]]; then + ipt_cmd=iptables + family=-4 + addr1=192.168.1.1 + addr2=192.168.1.3/24 + else + addr1=2001:db8::1 + addr2="2001:db8::3/64 nodad" + fi + + cfg_veth + ip -netns "${PEER_NS}" addr add dev veth1 ${addr2} + + # fool the GRO engine changing the destination address ... + ip netns exec "${PEER_NS}" $ipt_cmd -t nat -I PREROUTING -d ${addr1} -j DNAT --to-destination ${addr2%/*} + + # ... so that GRO will match the UDP_GRO enabled socket, but packets + # will land on the 'plain' one + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -x veth1 -b ${addr1} -n 0 & + pid=$! + ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${family} -b ${addr2%/*} ${rx_args} && \ + echo "ok" || \ + echo "failed"& + + sleep 0.1 + ./udpgso_bench_tx ${tx_args} + kill -INT $pid + wait $(jobs -p) +} + +run_nat_test() { + local -r args=$@ + + printf " %-40s" "$1" + ./in_netns.sh $0 __subprocess_nat $2 rx -r $3 +} + +run_all() { + local -r core_args="-l 4" + local -r ipv4_args="${core_args} -4 -D 192.168.1.1" + local -r ipv6_args="${core_args} -6 -D 2001:db8::1" + + echo "ipv4" + run_test "no GRO" "${ipv4_args} -M 10 -s 1400" "-4 -n 10 -l 1400" + + # explicitly check we are not receiving UDP_SEGMENT cms
[RFC PATCH v3 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx
XDP support will be used by a later patch to test the GRO path in a net namespace, leveraging the veth XDP implementation. To avoid breaking existing setup, XDP support is conditionally enabled and build only if llc is locally available. rfc v2 -> rfc v3: - move 'x' option handling here Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/Makefile | 69 +++ tools/testing/selftests/net/udpgso_bench_rx.c | 41 ++- tools/testing/selftests/net/xdp_dummy.c | 13 3 files changed, 121 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/net/xdp_dummy.c diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 256d82d5fa87..176459b7c4d6 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls KSFT_KHDR_INSTALL := 1 + +# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: +# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang +LLC ?= llc +CLANG ?= clang +LLVM_OBJCOPY ?= llvm-objcopy +BTF_PAHOLE ?= pahole +HAS_LLC := $(shell which $(LLC) 2>/dev/null) + +# conditional enable testes requiring llc +ifneq (, $(HAS_LLC)) +TEST_GEN_FILES += xdp_dummy.o +endif + include ../lib.mk +ifneq (, $(HAS_LLC)) + +# Detect that we're cross compiling and use the cross compiler +ifdef CROSS_COMPILE +CLANG_ARCH_ARGS = -target $(ARCH) +endif + +PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1) + +# Let newer LLVM versions transparently probe the kernel for availability +# of full BPF instruction set. +ifeq ($(PROBE),) + CPU ?= probe +else + CPU ?= generic +endif + +SRC_PATH := $(abspath ../../../..) +LIB_PATH := $(SRC_PATH)/tools/lib +XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH) +LIBBPF = $(LIB_PATH)/bpf/libbpf.a +BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris) +BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF) +BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm') +CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \ +| sed -n '/<...> search starts here:/,/End of search list./{ s| \(/.*\)|-idirafter \1|p }') +CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \ + $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types + +ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),) + CLANG_CFLAGS += -g + LLC_FLAGS += -mattr=dwarfris + DWARF2BTF = y +endif + +$(LIBBPF): FORCE +# Fix up variables inherited from Kbuild that tools/ build system won't like + $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= $(nodir $@) + +$(OUTPUT)/udpgso_bench_rx: $(OUTPUT)/udpgso_bench_rx.c $(LIBBPF) + $(CC) -o $@ $(XDP_CFLAGS) $(CFLAGS) $(LOADLIBES) $(LDLIBS) $^ -lelf + +FORCE: + +# bpf program[s] generation +$(OUTPUT)/%.o: %.c + $(CLANG) $(CLANG_FLAGS) \ +-O2 -target bpf -emit-llvm -c $< -o - | \ + $(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@ +ifeq ($(DWARF2BTF),y) + $(BTF_PAHOLE) -J $@ +endif + +endif + $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread $(OUTPUT)/tcp_inq: LDFLAGS += -lpthread diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c b/tools/testing/selftests/net/udpgso_bench_rx.c index 8f48d7fb32cf..5dcb719abe04 100644 --- a/tools/testing/selftests/net/udpgso_bench_rx.c +++ b/tools/testing/selftests/net/udpgso_bench_rx.c @@ -31,6 +31,10 @@ #include #include +#ifdef SUPPORT_XDP +#include "bpf/libbpf.h" +#endif + #ifndef UDP_GRO #define UDP_GRO104 #endif @@ -40,6 +44,9 @@ static bool cfg_tcp; static bool cfg_verify; static bool cfg_read_all; static bool cfg_gro_segment; +#ifdef SUPPORT_XDP +static int cfg_xdp_iface; +#endif static bool interrupted; static unsigned long packets, bytes; @@ -202,14 +209,14 @@ static void do_flush_udp(int fd) static void usage(const char *filepath) { - error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath); + error(1, 0, "Usage: %s [-Grtv] [-p port] [-x device]", filepath); } static void parse_opts(int argc, char **argv) { int c; - while ((c = getopt(argc, argv, "Gp:rtv")) != -1) { + while ((c = getopt(argc, argv, "Gp:rtvx:")) != -1) { switch (c) { case 'G': cfg_gro_segment = true; @@ -227,6 +234,13 @@ static void parse_opts(int argc, char **argv) cfg_verify = true; cfg_read_all = true; break; +#ifdef SUPPORT_XDP + case 'x': + cfg_xdp_iface = if_nametoindex(optarg); + if (!cfg_xdp_iface) +
[RFC PATCH v3 07/10] selftests: add GRO support to udp bench rx program
And fix a couple of buglets (port option processing, clean termination on SIGINT). This is preparatory work for GRO tests. rfc v2 -> rfc v3: - use ETH_MAX_MTU Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++ 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c b/tools/testing/selftests/net/udpgso_bench_rx.c index 727cf67a3f75..8f48d7fb32cf 100644 --- a/tools/testing/selftests/net/udpgso_bench_rx.c +++ b/tools/testing/selftests/net/udpgso_bench_rx.c @@ -31,9 +31,15 @@ #include #include +#ifndef UDP_GRO +#define UDP_GRO104 +#endif + static int cfg_port = 8000; static bool cfg_tcp; static bool cfg_verify; +static bool cfg_read_all; +static bool cfg_gro_segment; static bool interrupted; static unsigned long packets, bytes; @@ -63,6 +69,8 @@ static void do_poll(int fd) do { ret = poll(&pfd, 1, 10); + if (interrupted) + break; if (ret == -1) error(1, errno, "poll"); if (ret == 0) @@ -70,7 +78,7 @@ static void do_poll(int fd) if (pfd.revents != POLLIN) error(1, errno, "poll: 0x%x expected 0x%x\n", pfd.revents, POLLIN); - } while (!ret && !interrupted); + } while (!ret); } static int do_socket(bool do_tcp) @@ -102,6 +110,8 @@ static int do_socket(bool do_tcp) error(1, errno, "listen"); do_poll(accept_fd); + if (interrupted) + exit(0); fd = accept(accept_fd, NULL, NULL); if (fd == -1) @@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len) /* Flush all outstanding datagrams. Verify first few bytes of each. */ static void do_flush_udp(int fd) { - static char rbuf[ETH_DATA_LEN]; + static char rbuf[ETH_MAX_MTU]; int ret, len, budget = 256; - len = cfg_verify ? sizeof(rbuf) : 0; + len = cfg_read_all ? sizeof(rbuf) : 0; while (budget--) { /* MSG_TRUNC will make return value full datagram length */ ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT); @@ -178,7 +188,7 @@ static void do_flush_udp(int fd) return; if (ret == -1) error(1, errno, "recv"); - if (len) { + if (len && cfg_verify) { if (ret == 0) error(1, errno, "recv: 0 byte datagram\n"); @@ -192,23 +202,30 @@ static void do_flush_udp(int fd) static void usage(const char *filepath) { - error(1, 0, "Usage: %s [-tv] [-p port]", filepath); + error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath); } static void parse_opts(int argc, char **argv) { int c; - while ((c = getopt(argc, argv, "ptv")) != -1) { + while ((c = getopt(argc, argv, "Gp:rtv")) != -1) { switch (c) { + case 'G': + cfg_gro_segment = true; + break; case 'p': - cfg_port = htons(strtoul(optarg, NULL, 0)); + cfg_port = strtoul(optarg, NULL, 0); + break; + case 'r': + cfg_read_all = true; break; case 't': cfg_tcp = true; break; case 'v': cfg_verify = true; + cfg_read_all = true; break; } } @@ -227,6 +244,12 @@ static void do_recv(void) fd = do_socket(cfg_tcp); + if (cfg_gro_segment && !cfg_tcp) { + int val = 1; + if (setsockopt(fd, IPPROTO_UDP, UDP_GRO, &val, sizeof(val))) + error(1, errno, "setsockopt UDP_GRO"); + } + treport = gettimeofday_ms() + 1000; do { do_poll(fd); -- 2.17.2
[RFC PATCH v3 02/10] udp: implement GRO for plain UDP sockets.
This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also eligible for GRO in the rx path: UDP segments directed to such socket are assembled into a larger GSO_UDP_L4 packet. The core UDP GRO support is enabled with setsockopt(UDP_GRO). Initial benchmark numbers: Before: udp rx: 1079 MB/s 769065 calls/s After: udp rx: 1466 MB/s24877 calls/s This change introduces a side effect in respect to UDP tunnels: after a UDP tunnel creation, now the kernel performs a lookup per ingress UDP packet, while before such lookup happened only if the ingress packet carried a valid internal header csum. rfc v2 -> rfc v3: - fixed typos in macro name and comments - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1 - acquire socket lock in UDP_GRO setsockopt rfc v1 -> rfc v2: - use a new option to enable UDP GRO - use static keys to protect the UDP GRO socket lookup Signed-off-by: Paolo Abeni -- Note: I opted for acquiring the socket lock only for the newly introduced setsockopt instead for every value, despite the previous conversation on this topic, to avoid introducing somewhat larger and unrelated changes. --- include/linux/udp.h | 3 +- include/uapi/linux/udp.h | 1 + net/ipv4/udp.c | 8 +++ net/ipv4/udp_offload.c | 109 +++ net/ipv6/udp_offload.c | 6 +-- 5 files changed, 99 insertions(+), 28 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index a4dafff407fb..f613b329852e 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -50,11 +50,12 @@ struct udp_sock { __u8 encap_type;/* Is this an Encapsulation socket? */ unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */ no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */ -encap_enabled:1; /* This socket enabled encap +encap_enabled:1, /* This socket enabled encap * processing; UDP tunnels and * different encapsulation layer set * this */ +gro_enabled:1; /* Can accept GRO packets */ /* * Following member retains the information to create a UDP header * when the socket is uncorked. diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h index 09502de447f5..30baccb6c9c4 100644 --- a/include/uapi/linux/udp.h +++ b/include/uapi/linux/udp.h @@ -33,6 +33,7 @@ struct udphdr { #define UDP_NO_CHECK6_TX 101 /* Disable sending checksum for UDP6X */ #define UDP_NO_CHECK6_RX 102 /* Disable accpeting checksum for UDP6 */ #define UDP_SEGMENT103 /* Set GSO segmentation size */ +#define UDP_GRO104 /* This socket can receive UDP GRO packets */ /* UDP encapsulation types */ #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */ diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c51721fb293a..4d4f4d044c28 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2474,6 +2474,14 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname, up->gso_size = val; break; + case UDP_GRO: + lock_sock(sk); + if (valbool) + udp_tunnel_encap_enable(sk->sk_socket); + up->gro_enabled = valbool; + release_sock(sk); + break; + /* * UDP-Lite's partial checksum coverage (RFC 3828). */ diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 802f2bc00d69..0646d61f4fa8 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -343,6 +343,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff *skb, return segs; } +#define UDP_GRO_CNT_MAX 64 +static struct sk_buff *udp_gro_receive_segment(struct list_head *head, + struct sk_buff *skb) +{ + struct udphdr *uh = udp_hdr(skb); + struct sk_buff *pp = NULL; + struct udphdr *uh2; + struct sk_buff *p; + + /* requires non zero csum, for symmetry with GSO */ + if (!uh->check) { + NAPI_GRO_CB(skb)->flush = 1; + return NULL; + } + + /* pull encapsulating udp header */ + skb_gro_pull(skb, sizeof(struct udphdr)); + skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr)); + + list_for_each_entry(p, head, list) { + if (!NAPI_GRO_CB(p)->same_flow) + continue; + + uh2 = udp_hdr(p); + + /* Match ports only, as csum is always non zero */ + if ((*(u32 *)&uh->source != *(u32 *)&uh2->source)) { + NAPI_GRO_CB(p)->same_flow = 0; +
[RFC PATCH v3 09/10] selftests: add some benchmark for UDP GRO
Run on top of veth pair, using a dummy XDP program to enable the GRO. Signed-off-by: Paolo Abeni --- tools/testing/selftests/net/Makefile| 1 + tools/testing/selftests/net/udpgro_bench.sh | 92 + 2 files changed, 93 insertions(+) create mode 100755 tools/testing/selftests/net/udpgro_bench.sh diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 176459b7c4d6..ac999354af54 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -7,6 +7,7 @@ CFLAGS += -I../../../../usr/include/ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh rtnetlink.sh TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh +TEST_PROGS += udpgro_bench.sh TEST_PROGS_EXTENDED := in_netns.sh TEST_GEN_FILES = socket TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy diff --git a/tools/testing/selftests/net/udpgro_bench.sh b/tools/testing/selftests/net/udpgro_bench.sh new file mode 100755 index ..03d37e5e7424 --- /dev/null +++ b/tools/testing/selftests/net/udpgro_bench.sh @@ -0,0 +1,92 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Run a series of udpgro benchmarks + +readonly PEER_NS="ns-peer-$(mktemp -u XX)" + +cleanup() { + local -r jobs="$(jobs -p)" + local -r ns="$(ip netns list|grep $PEER_NS)" + + [ -n "${jobs}" ] && kill -INT ${jobs} 2>/dev/null + [ -n "$ns" ] && ip netns del $ns 2>/dev/null +} +trap cleanup EXIT + +run_one() { + # use 'rx' as separator between sender args and receiver args + local -r all="$@" + local -r tx_args=${all%rx*} + local -r rx_args=${all#*rx} + + ip netns add "${PEER_NS}" + ip -netns "${PEER_NS}" link set lo up + ip link add type veth + ip link set dev veth0 up + ip addr add dev veth0 192.168.1.2/24 + ip addr add dev veth0 2001:db8::2/64 nodad + + ip link set dev veth1 netns "${PEER_NS}" + ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24 + ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad + ip -netns "${PEER_NS}" link set dev veth1 up + + ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r -x veth1 & + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -t ${rx_args} -r & + + # Hack: let bg programs complete the startup + sleep 0.1 + ./udpgso_bench_tx ${tx_args} +} + +run_in_netns() { + local -r args=$@ + + ./in_netns.sh $0 __subprocess ${args} +} + +run_udp() { + local -r args=$@ + + echo "udp gso - over veth touching data" + run_in_netns ${args} -S rx + + echo "udp gso and gro - over veth touching data" + run_in_netns ${args} -S rx -G +} + +run_tcp() { + local -r args=$@ + + echo "tcp - over veth touching data" + run_in_netns ${args} -t rx +} + +run_all() { + local -r core_args="-l 4" + local -r ipv4_args="${core_args} -4 -D 192.168.1.1" + local -r ipv6_args="${core_args} -6 -D 2001:db8::1" + + echo "ipv4" + run_tcp "${ipv4_args}" + run_udp "${ipv4_args}" + + echo "ipv6" + run_tcp "${ipv4_args}" + run_udp "${ipv6_args}" +} + +if [ ! -f xdp_dummy.o ]; then + echo "Skipping GRO benchmarks - missing LLC" + exit 0 +fi + +if [[ $# -eq 0 ]]; then + run_all +elif [[ $1 == "__subprocess" ]]; then + shift + run_one $@ +else + run_in_netns $@ +fi -- 2.17.2
[RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed
The *encap_needed static keys are enabled by UDP tunnels and several UDP encapsulations type, but they are never turned off. This can cause unneeded overall performance degradation for systems where such features are used transiently. This patch introduces complete book-keeping for such keys, decreasing the usage at socket destruction time, if needed, and avoiding that the same socket could increase the key usage multiple times. rfc v2 - rfc v3: - use udp_tunnel_encap_enable() in setsockopt() Signed-off-by: Paolo Abeni --- include/linux/udp.h | 7 ++- include/net/udp_tunnel.h | 6 ++ net/ipv4/udp.c | 17 +++-- net/ipv6/udp.c | 14 +- 4 files changed, 32 insertions(+), 12 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index 320d49d85484..a4dafff407fb 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -49,7 +49,12 @@ struct udp_sock { unsigned int corkflag; /* Cork is required */ __u8 encap_type;/* Is this an Encapsulation socket? */ unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */ -no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */ +no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */ +encap_enabled:1; /* This socket enabled encap + * processing; UDP tunnels and + * different encapsulation layer set + * this + */ /* * Following member retains the information to create a UDP header * when the socket is uncorked. diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h index fe680ab6b15a..3fbe56430e3b 100644 --- a/include/net/udp_tunnel.h +++ b/include/net/udp_tunnel.h @@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct sk_buff *skb, bool udp_csum) static inline void udp_tunnel_encap_enable(struct socket *sock) { + struct udp_sock *up = udp_sk(sock->sk); + + if (up->encap_enabled) + return; + + up->encap_enabled = 1; #if IS_ENABLED(CONFIG_IPV6) if (sock->sk->sk_family == PF_INET6) ipv6_stub->udpv6_encap_enable(); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index ca3ed931f2a9..c51721fb293a 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -115,6 +115,7 @@ #include "udp_impl.h" #include #include +#include struct udp_table udp_table __read_mostly; EXPORT_SYMBOL(udp_table); @@ -2398,11 +2399,15 @@ void udp_destroy_sock(struct sock *sk) bool slow = lock_sock_fast(sk); udp_flush_pending_frames(sk); unlock_sock_fast(sk, slow); - if (static_branch_unlikely(&udp_encap_needed_key) && up->encap_type) { - void (*encap_destroy)(struct sock *sk); - encap_destroy = READ_ONCE(up->encap_destroy); - if (encap_destroy) - encap_destroy(sk); + if (static_branch_unlikely(&udp_encap_needed_key)) { + if (up->encap_type) { + void (*encap_destroy)(struct sock *sk); + encap_destroy = READ_ONCE(up->encap_destroy); + if (encap_destroy) + encap_destroy(sk); + } + if (up->encap_enabled) + static_branch_disable(&udp_encap_needed_key); } } @@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname, /* FALLTHROUGH */ case UDP_ENCAP_L2TPINUDP: up->encap_type = val; - udp_encap_enable(); + udp_tunnel_encap_enable(sk->sk_socket); break; default: err = -ENOPROTOOPT; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index d2d97d07ef27..fc0ce6c59ebb 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1458,11 +1458,15 @@ void udpv6_destroy_sock(struct sock *sk) udp_v6_flush_pending_frames(sk); release_sock(sk); - if (static_branch_unlikely(&udpv6_encap_needed_key) && up->encap_type) { - void (*encap_destroy)(struct sock *sk); - encap_destroy = READ_ONCE(up->encap_destroy); - if (encap_destroy) - encap_destroy(sk); + if (static_branch_unlikely(&udpv6_encap_needed_key)) { + if (up->encap_type) { + void (*encap_destroy)(struct sock *sk); + encap_destroy = READ_ONCE(up->encap_destroy); + if (encap_destroy) + encap_destroy(sk); + } + if (up->encap_enabled) + static_branch_disable(&udpv6_encap_needed_key); }
[RFC PATCH v3 04/10] ip: factor out protocol delivery helper
So that we can re-use it at the UDP lavel in a later patch Signed-off-by: Paolo Abeni --- net/ipv4/ip_input.c | 73 ++--- 1 file changed, 36 insertions(+), 37 deletions(-) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index 35a786c0aaa0..72250b4e466d 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb) return false; } -static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int protocol) { - __skb_pull(skb, skb_network_header_len(skb)); - - rcu_read_lock(); - { - int protocol = ip_hdr(skb)->protocol; - const struct net_protocol *ipprot; - int raw; + const struct net_protocol *ipprot; + int raw, ret; - resubmit: - raw = raw_local_deliver(skb, protocol); +resubmit: + raw = raw_local_deliver(skb, protocol); - ipprot = rcu_dereference(inet_protos[protocol]); - if (ipprot) { - int ret; - - if (!ipprot->no_policy) { - if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { - kfree_skb(skb); - goto out; - } - nf_reset(skb); + ipprot = rcu_dereference(inet_protos[protocol]); + if (ipprot) { + if (!ipprot->no_policy) { + if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { + kfree_skb(skb); + return; } - ret = ipprot->handler(skb); - if (ret < 0) { - protocol = -ret; - goto resubmit; + nf_reset(skb); + } + ret = ipprot->handler(skb); + if (ret < 0) { + protocol = -ret; + goto resubmit; + } + __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS); + } else { + if (!raw) { + if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { + __IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS); + icmp_send(skb, ICMP_DEST_UNREACH, + ICMP_PROT_UNREACH, 0); } - __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS); + kfree_skb(skb); } else { - if (!raw) { - if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) { - __IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS); - icmp_send(skb, ICMP_DEST_UNREACH, - ICMP_PROT_UNREACH, 0); - } - kfree_skb(skb); - } else { - __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS); - consume_skb(skb); - } + __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS); + consume_skb(skb); } } - out: +} + +static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb) +{ + __skb_pull(skb, skb_network_header_len(skb)); + + rcu_read_lock(); + ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol); rcu_read_unlock(); return 0; -- 2.17.2
[RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection
In some scenarios, the GRO engine can assemble an UDP GRO packet that ultimately lands on a non GRO-enabled socket. This patch tries to address the issue explicitly checking for the UDP socket features before enqueuing the packet, and eventually segmenting the unexpected GRO packet, as needed. We must also cope with re-insertion requests: after segmentation the UDP code calls the helper introduced by the previous patches, as needed. Segmentation is performed by a common helper, which takes care of updating socket and protocol stats is case of failure. rfc v2 -> rfc v3 - moved udp_rcv_segment() into net/udp.h, account errors to socket and ns, always return NULL or segs list Signed-off-by: Paolo Abeni --- include/linux/udp.h | 6 ++ include/net/udp.h | 51 ++--- net/ipv4/udp.c | 25 +- net/ipv6/udp.c | 27 +++- 4 files changed, 99 insertions(+), 10 deletions(-) diff --git a/include/linux/udp.h b/include/linux/udp.h index e23d5024f42f..0a9c54e76305 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -132,6 +132,12 @@ static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk, } } +static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb) +{ + return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) && + skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4; +} + #define udp_portaddr_for_each_entry(__sk, list) \ hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) diff --git a/include/net/udp.h b/include/net/udp.h index 9e82cb391dea..f94aed316a04 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff *skb, int len, int off, } while(0) #if IS_ENABLED(CONFIG_IPV6) -#define __UDPX_INC_STATS(sk, field)\ -do { \ - if ((sk)->sk_family == AF_INET) \ - __UDP_INC_STATS(sock_net(sk), field, 0);\ - else\ - __UDP6_INC_STATS(sock_net(sk), field, 0); \ -} while (0) +#define __UDPX_MIB(sk, ipv4) \ +({ \ + ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \ +sock_net(sk)->mib.udp_statistics) :\ + (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \ +sock_net(sk)->mib.udp_stats_in6); \ +}) #else -#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0) +#define __UDPX_MIB(sk, ipv4) \ +({ \ + IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \ +sock_net(sk)->mib.udp_statistics; \ +}) #endif +#define __UDPX_INC_STATS(sk, field) \ + __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET, field) + #ifdef CONFIG_PROC_FS struct udp_seq_afinfo { sa_family_t family; @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key); void udpv6_encap_enable(void); #endif +static inline struct sk_buff *udp_rcv_segment(struct sock *sk, + struct sk_buff *skb) +{ + bool ipv4 = skb->protocol == htons(ETH_P_IP); + int segs_nr = skb_shinfo(skb)->gso_segs; + struct sk_buff *segs; + + /* the GSO CB lays after the UDP one, no need to save and restore any +* CB fragment +*/ + segs = __skb_gso_segment(skb, NETIF_F_SG, false); + if (unlikely(IS_ERR(segs))) { + kfree_skb(skb); + goto drop; + } + + if (unlikely(!segs)) + goto drop; + + consume_skb(skb); + return segs; + +drop: + atomic_add(segs_nr, &sk->sk_drops); + SNMP_ADD_STATS(__UDPX_MIB(sk, ipv4), UDP_MIB_INERRORS, segs_nr); + return NULL; +} + #endif /* _UDP_H */ diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index b345f71b1cbb..b45033f63673 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable); * Note that in the success and error cases, the skb is assumed to * have either been requeued or freed. */ -static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) +static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb) { struct udp_sock *up = udp_sk(sk); int is_udplite = IS_UDPLITE(sk); @@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) return -1; } +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, i
[RFC PATCH v3 05/10] ipv6: factor out protocol delivery helper
So that we can re-use it at the UDP lavel in the next patch Signed-off-by: Paolo Abeni --- net/ipv6/ip6_input.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 96577e742afd..3065226bdc57 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt, /* * Deliver the packet to the host */ - - -static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb) +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int nexthdr, + bool have_final) { const struct inet6_protocol *ipprot; struct inet6_dev *idev; unsigned int nhoff; - int nexthdr; bool raw; - bool have_final = false; /* * Parse extension headers */ - rcu_read_lock(); resubmit: idev = ip6_dst_idev(skb_dst(skb)); - if (!pskb_pull(skb, skb_transport_offset(skb))) - goto discard; nhoff = IP6CB(skb)->nhoff; - nexthdr = skb_network_header(skb)[nhoff]; + if (!have_final) { + if (!pskb_pull(skb, skb_transport_offset(skb))) + goto discard; + nexthdr = skb_network_header(skb)[nhoff]; + } resubmit_final: raw = raw6_local_deliver(skb, nexthdr); @@ -411,13 +409,19 @@ static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *sk consume_skb(skb); } } - rcu_read_unlock(); - return 0; + return; discard: __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); - rcu_read_unlock(); kfree_skb(skb); +} + +static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff *skb) +{ + rcu_read_lock(); + ip6_protocol_deliver_rcu(net, skb, 0, false); + rcu_read_unlock(); + return 0; } -- 2.17.2
[RFC PATCH v3 03/10] udp: add support for UDP_GRO cmsg
When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress datagram size. User-space can use such info to compute the original packets layout. Signed-off-by: Paolo Abeni --- Note: I avoided setting a bit in cmsg_flag for UDP_GRO, as that attempt produced some uglyfication, expecially on the ipv6 side with no measurable performances benefits. --- include/linux/udp.h | 11 +++ net/ipv4/udp.c | 4 net/ipv6/udp.c | 3 +++ 3 files changed, 18 insertions(+) diff --git a/include/linux/udp.h b/include/linux/udp.h index f613b329852e..e23d5024f42f 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -121,6 +121,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk) return udp_sk(sk)->no_check6_rx; } +static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk, +struct sk_buff *skb) +{ + int gso_size; + + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) { + gso_size = skb_shinfo(skb)->gso_size; + put_cmsg(msg, SOL_UDP, UDP_GRO, sizeof(gso_size), &gso_size); + } +} + #define udp_portaddr_for_each_entry(__sk, list) \ hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 4d4f4d044c28..b345f71b1cbb 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock, memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); *addr_len = sizeof(*sin); } + + if (udp_sk(sk)->gro_enabled) + udp_cmsg_recv(msg, sk, skb); + if (inet->cmsg_flags) ip_cmsg_recv_offset(msg, sk, skb, sizeof(struct udphdr), off); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index fc0ce6c59ebb..8e76e719305c 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -421,6 +421,9 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, *addr_len = sizeof(*sin6); } + if (udp_sk(sk)->gro_enabled) + udp_cmsg_recv(msg, sk, skb); + if (np->rxopt.all) ip6_datagram_recv_common_ctl(sk, msg, skb); -- 2.17.2
[RFC PATCH v3 00/10] udp: implement GRO support
This series implements GRO support for UDP sockets, as the RX counterpart of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT"). The core functionality is implemented by the second patch, introducing a new sockopt to enable UDP_GRO, while patch 3 implements support for passing the segment size to the user space via a new cmsg. UDP GRO performs a socket lookup for each ingress packets and aggregate datagram directed to UDP GRO enabled sockets with constant l4 tuple. UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT rules, and that could potentially confuse existing applications. The solution adopted here is to de-segment the GRO packet before enqueuing as needed. Since we must cope with packet reinsertion after de-segmentation, the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed to UDP usage. While the current code can probably be improved, this safeguard ,implemented in the patches 4-7, allows future enachements to enable UDP GSO offload on more virtual devices eventually even on forwarded packets. The last 4 for patches implement some performance and functional self-tests, re-using the existing udpgso infrastructure. The problematic scenario described above is explicitly tested. This revision of the series try to address the feedback provided by Willem, Steffen and Subash fixing several bugs all along rfc v2 - rfc v3: - cope better with exceptional conditions - test cases cleanup rfc v1 - rfc v2: - use a new option to enable UDP GRO - use static keys to protect the UDP GRO socket lookup - cope with UDP GRO misdirection - add self-tests Paolo Abeni (10): udp: implement complete book-keeping for encap_needed udp: implement GRO for plain UDP sockets. udp: add support for UDP_GRO cmsg ip: factor out protocol delivery helper ipv6: factor out protocol delivery helper udp: cope with UDP GRO packet misdirection selftests: add GRO support to udp bench rx program selftests: conditionally enable XDP support in udpgso_bench_rx selftests: add some benchmark for UDP GRO selftests: add functionals test for UDP GRO include/linux/udp.h | 25 ++- include/net/udp.h | 51 - include/net/udp_tunnel.h | 6 + include/uapi/linux/udp.h | 1 + net/ipv4/ip_input.c | 73 --- net/ipv4/udp.c| 54 - net/ipv4/udp_offload.c| 109 -- net/ipv6/ip6_input.c | 28 +-- net/ipv6/udp.c| 44 +++- net/ipv6/udp_offload.c| 6 +- tools/testing/selftests/net/Makefile | 70 +++ tools/testing/selftests/net/udpgro.sh | 147 + tools/testing/selftests/net/udpgro_bench.sh | 94 + tools/testing/selftests/net/udpgso_bench.sh | 2 +- tools/testing/selftests/net/udpgso_bench_rx.c | 193 -- tools/testing/selftests/net/udpgso_bench_tx.c | 22 +- tools/testing/selftests/net/xdp_dummy.c | 13 ++ 17 files changed, 816 insertions(+), 122 deletions(-) create mode 100755 tools/testing/selftests/net/udpgro.sh create mode 100755 tools/testing/selftests/net/udpgro_bench.sh create mode 100644 tools/testing/selftests/net/xdp_dummy.c -- 2.17.2
Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
On 10/30/18 10:38 AM, Stephen Hemminger wrote: > On Tue, 30 Oct 2018 10:34:45 -0600 > David Ahern wrote: > >> On 10/30/18 9:05 AM, Stefano Brivio wrote: >>> Now that we have an abstraction for columns, it's relatively easy to >>> selectively display only some of them, and Yoann has a use case for it. >>> >>> Patch 1/3 fixes a rendering issue that shows up only when display of >>> arbitrary columns is disabled. Patch 2/3 implements the relevant option, >>> and patch 3/3 makes the output more readable when some columns are >>> disabled. >>> >>> >> >> I like the intent, and I have prototyped something similar for 'ip'. >> >> A more flexible approach is to use format strings to allow users to >> customize the output order and whitespace as well. So for ss and your >> column list (winging it here): >> >> netid = %N >> state = %S >> recv Q = %Qr >> send Q = %Qs >> local address = %Al >> lport port = %Pl >> remote address = %Ar >> remote port= %Pr >> process data = %p >> ... >> >> then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n" >> >> or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n" >> >> I have not had time to look into an implementation for ip. Conceptually >> - and scanning the kernel's vsprintf code - it does not look that >> difficult, just time consuming on the frontend with the initial setup. > > The problem with custom formats is that you lose all ability for Gcc > to check format strings. > Sure, trade-offs. A custom print string is powerful. While selecting columns is an improvement, column ordering is also important - even handling other output formats (csv).
Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
On Tue, 30 Oct 2018 10:34:45 -0600 David Ahern wrote: > On 10/30/18 9:05 AM, Stefano Brivio wrote: > > Now that we have an abstraction for columns, it's relatively easy to > > selectively display only some of them, and Yoann has a use case for it. > > > > Patch 1/3 fixes a rendering issue that shows up only when display of > > arbitrary columns is disabled. Patch 2/3 implements the relevant option, > > and patch 3/3 makes the output more readable when some columns are > > disabled. > > > > > > I like the intent, and I have prototyped something similar for 'ip'. > > A more flexible approach is to use format strings to allow users to > customize the output order and whitespace as well. So for ss and your > column list (winging it here): > > netid = %N > state = %S > recv Q = %Qr > send Q = %Qs > local address = %Al > lport port = %Pl > remote address = %Ar > remote port= %Pr > process data = %p > ... > > then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n" > > or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n" > > I have not had time to look into an implementation for ip. Conceptually > - and scanning the kernel's vsprintf code - it does not look that > difficult, just time consuming on the frontend with the initial setup. The problem with custom formats is that you lose all ability for Gcc to check format strings.
Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
On 10/30/18 9:05 AM, Stefano Brivio wrote: > Now that we have an abstraction for columns, it's relatively easy to > selectively display only some of them, and Yoann has a use case for it. > > Patch 1/3 fixes a rendering issue that shows up only when display of > arbitrary columns is disabled. Patch 2/3 implements the relevant option, > and patch 3/3 makes the output more readable when some columns are > disabled. > > I like the intent, and I have prototyped something similar for 'ip'. A more flexible approach is to use format strings to allow users to customize the output order and whitespace as well. So for ss and your column list (winging it here): netid = %N state = %S recv Q = %Qr send Q = %Qs local address = %Al lport port = %Pl remote address = %Ar remote port= %Pr process data = %p ... then a format string could be: "%S %Qr %Qs %Al:%Pl %Ar:%Pr %p\n" or for csv output: "%S,%Qr,%Qs,%Al,%Pl,%Ar,%Pr,%p\n" I have not had time to look into an implementation for ip. Conceptually - and scanning the kernel's vsprintf code - it does not look that difficult, just time consuming on the frontend with the initial setup.
[RFC bpf-next] libbpf: increase rlimit before trying to create BPF maps
The limit for memory locked in the kernel by a process is usually set to 64 bytes by default. This can be an issue when creating large BPF maps. A workaround is to raise this limit for the current process before trying to create a new BPF map. Changing the hard limit requires the CAP_SYS_RESOURCE and can usually only be done by root user (but then only root can create BPF maps). As far as I know there is not API to get the current amount of memory locked for a user, therefore we cannot raise the limit only when required. One solution, used by bcc, is to try to create the map, and on getting a EPERM error, raising the limit to infinity before giving another try. Another approach, used in iproute, is to raise the limit in all cases, before trying to create the map. Here we do the same as in iproute2: the rlimit is raised to infinity before trying to load the map. I send this patch as a RFC to see if people would prefer the bcc approach instead, or the rlimit change to be in bpftool rather than in libbpf. Signed-off-by: Quentin Monnet --- tools/lib/bpf/bpf.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 03f9bcc4ef50..456a5a7b112c 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include "bpf.h" #include "libbpf.h" #include @@ -68,8 +70,11 @@ static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr) { __u32 name_len = create_attr->name ? strlen(create_attr->name) : 0; + struct rlimit rinf = { RLIM_INFINITY, RLIM_INFINITY }; union bpf_attr attr; + setrlimit(RLIMIT_MEMLOCK, &rinf); + memset(&attr, '\0', sizeof(attr)); attr.map_type = create_attr->map_type; -- 2.7.4
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
On 30/10/18 15:10, Thomas Petazzoni wrote: > Hello, > > On Tue, 30 Oct 2018 14:55:01 +, Marc Zyngier wrote: > >>> I.e, isn't the firmware fix papering over a bug that should be fixed in >>> Linux mvpp2 driver anyway ? >> >> Absolutely. Leaving this unpatched in the kernel, with a 100% chance of >> memory corruption is just mad. >> >> I'm pretty sure there should be a way to sanely reset the interface >> before it starts repainting the memory. > > I agree here. Do you still have an image of that old firmware version, > so that we can try to reproduce, and see if we can come up with a way > to reset the BM on boot up that would avoid this issue ? Yup. I still have both the original build tree as well as the sdcard, so you should be able to trigger on demand. I'll email you the stuff separately, unless you want another delivery method. Thanks, M. -- Jazz is not dead. It just smells funny...
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
Hello, On Tue, 30 Oct 2018 14:55:01 +, Marc Zyngier wrote: > > I.e, isn't the firmware fix papering over a bug that should be fixed in > > Linux mvpp2 driver anyway ? > > Absolutely. Leaving this unpatched in the kernel, with a 100% chance of > memory corruption is just mad. > > I'm pretty sure there should be a way to sanely reset the interface > before it starts repainting the memory. I agree here. Do you still have an image of that old firmware version, so that we can try to reproduce, and see if we can come up with a way to reset the BM on boot up that would avoid this issue ? Thanks, Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
[PATCH iproute2 net-next 1/3] ss: Discard empty descriptor at the end of buffer, if any, before rendering
This will allow us to disable display of any given column. Signed-off-by: Stefano Brivio --- misc/ss.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index c8970438ce73..c3f61ef66258 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -1245,8 +1245,15 @@ static void render(void) token = (struct buf_token *)buffer.head->data; - /* Ensure end alignment of last token, it wasn't necessarily flushed */ - buffer.tail->end += buffer.cur->len % 2; + if (!buffer.cur->len) { + /* Last token was flushed, a new empty descriptor was appended: +* discard it +*/ + buffer.tail->end -= sizeof(buffer.cur->len); + } else { + /* Last token wasn't flushed: ensure end alignment */ + buffer.tail->end += buffer.cur->len % 2; + } render_calc_width(); -- 2.19.1
[PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed
Now that we have an abstraction for columns, it's relatively easy to selectively display only some of them, and Yoann has a use case for it. Patch 1/3 fixes a rendering issue that shows up only when display of arbitrary columns is disabled. Patch 2/3 implements the relevant option, and patch 3/3 makes the output more readable when some columns are disabled. Stefano Brivio (3): ss: Discard empty descriptor at the end of buffer, if any, before rendering ss: Introduce option to display selected columns only ss: Beautify output when arbitrary columns are hidden man/man8/ss.8 | 5 +++ misc/ss.c | 85 +++ 2 files changed, 77 insertions(+), 13 deletions(-) -- 2.19.1
[PATCH iproute2 net-next 3/3] ss: Beautify output when arbitrary columns are hidden
Define a secondary alignment for columns in case the next column is hidden, this avoids awkward outputs if e.g. the local address is shown, but not the local port. Omit embedded delimiter in socket specifiers if the port or service field is hidden. Signed-off-by: Stefano Brivio --- misc/ss.c | 30 -- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 91be3c6db151..d489233681e9 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -131,7 +131,8 @@ enum col_align { }; struct column { - const enum col_align align; + enum col_align align; + const enum col_align align_without_next; const char *optname; const char *header; const char *ldelim; @@ -141,15 +142,15 @@ struct column { }; static struct column columns[] = { - { ALIGN_LEFT, "netid","Netid","", 0, 0, 0 }, - { ALIGN_LEFT, "state","State"," ", 0, 0, 0 }, - { ALIGN_LEFT, "recvq","Recv-Q", " ", 0, 0, 0 }, - { ALIGN_LEFT, "sendq","Send-Q", " ", 0, 0, 0 }, - { ALIGN_RIGHT, "local","Local Address:", " ", 0, 0, 0 }, - { ALIGN_LEFT, "lport","Port", "", 0, 0, 0 }, - { ALIGN_RIGHT, "peer", "Peer Address:"," ", 0, 0, 0 }, - { ALIGN_LEFT, "pport","Port", "", 0, 0, 0 }, - { ALIGN_LEFT, "ext", "", "", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "netid", "Netid", "", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "state", "State", " ", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "recvq", "Recv-Q", " ", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "sendq", "Send-Q", " ", 0, 0, 0 }, + { ALIGN_RIGHT, ALIGN_LEFT, "local", "Local Address:", " ", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "lport", "Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, ALIGN_LEFT, "peer", "Peer Address:", " ", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "pport", "Port", "", 0, 0, 0 }, + { ALIGN_LEFT, ALIGN_LEFT, "ext", "", "", 0, 0, 0 }, }; static struct column *current_field = columns; @@ -1374,6 +1375,9 @@ static void sock_details_print(struct sockstat *s) static void sock_addr_print(const char *addr, char *delim, const char *port, const char *ifname) { + if ((current_field + 1)->disabled) + delim = ""; + if (ifname) out("%s" "%%" "%s%s", addr, ifname, delim); else @@ -5006,6 +5010,12 @@ int main(int argc, char *argv[]) } p = p1 + 1; } while (p1); + + for (f = columns; field_is_valid(f + 1); f++) { + if ((f + 1)->disabled) + f->align = f->align_without_next; + } + break; } case 'h': -- 2.19.1
[PATCH iproute2 net-next 2/3] ss: Introduce option to display selected columns only
The new option --columns (short: -c) allows to select columns to be displayed. Note that this doesn't affect the order in which columns are displayed. Reported-by: Yoann P. Signed-off-by: Stefano Brivio --- man/man8/ss.8 | 5 + misc/ss.c | 62 ++- 2 files changed, 57 insertions(+), 10 deletions(-) diff --git a/man/man8/ss.8 b/man/man8/ss.8 index 7a6572b17364..c987dec6bcd7 100644 --- a/man/man8/ss.8 +++ b/man/man8/ss.8 @@ -24,6 +24,11 @@ Output version information. .B \-H, \-\-no-header Suppress header line. .TP +.B \-c COLS, \-\-columns=COLS +Only display selected columns, separated by commas. The following column names +are understood: netid, state, local, lport, peer, pport, ext. This does not +define the order of columns. +.TP .B \-n, \-\-numeric Do not try to resolve service names. .TP diff --git a/misc/ss.c b/misc/ss.c index c3f61ef66258..91be3c6db151 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -132,6 +132,7 @@ enum col_align { struct column { const enum col_align align; + const char *optname; const char *header; const char *ldelim; int disabled; @@ -140,15 +141,15 @@ struct column { }; static struct column columns[] = { - { ALIGN_LEFT, "Netid","", 0, 0, 0 }, - { ALIGN_LEFT, "State"," ",0, 0, 0 }, - { ALIGN_LEFT, "Recv-Q", " ",0, 0, 0 }, - { ALIGN_LEFT, "Send-Q", " ",0, 0, 0 }, - { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_LEFT, "", "", 0, 0, 0 }, + { ALIGN_LEFT, "netid","Netid","", 0, 0, 0 }, + { ALIGN_LEFT, "state","State"," ", 0, 0, 0 }, + { ALIGN_LEFT, "recvq","Recv-Q", " ", 0, 0, 0 }, + { ALIGN_LEFT, "sendq","Send-Q", " ", 0, 0, 0 }, + { ALIGN_RIGHT, "local","Local Address:", " ", 0, 0, 0 }, + { ALIGN_LEFT, "lport","Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, "peer", "Peer Address:"," ", 0, 0, 0 }, + { ALIGN_LEFT, "pport","Port", "", 0, 0, 0 }, + { ALIGN_LEFT, "ext", "", "", 0, 0, 0 }, }; static struct column *current_field = columns; @@ -1073,6 +1074,11 @@ static int field_is_last(struct column *f) return f - columns == COL_MAX - 1; } +static int field_is_valid(struct column *f) +{ + return f >= columns && f - columns < COL_MAX; +} + static void field_next(void) { field_flush(current_field); @@ -4666,6 +4672,8 @@ static void _usage(FILE *dest) "\n" " -K, --kill forcibly close sockets, display what was closed\n" " -H, --no-header Suppress header line\n" +" -c, --columns=COLS display only COLS columns\n" +" COLS := {netid|state|local|lport|peer|pport|ext}[,COLS]\n" "\n" " -A, --query=QUERY, --socket=QUERY\n" " QUERY := {all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n" @@ -4785,6 +4793,7 @@ static const struct option long_opts[] = { { "tipcinfo", 0, 0, OPT_TIPCINFO}, { "kill", 0, 0, 'K' }, { "no-header", 0, 0, 'H' }, + { "columns", 1, 0, 'c' }, { 0 } }; @@ -4800,7 +4809,7 @@ int main(int argc, char *argv[]) int state_filter = 0; while ((ch = getopt_long(argc, argv, -"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHS", +"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHc:S", long_opts, NULL)) != EOF) { switch (ch) { case 'n': @@ -4966,6 +4975,39 @@ int main(int argc, char *argv[]) case 'H': show_header = 0; break; + case 'c': + { + struct column *f; + char *p, *p1; + + if (!optarg) { + fprintf(stderr, "ss: No columns given.\n"); + usage(); + } + + for (f = columns; field_is_valid(f); f++) + f->disabled = 1; + + p = optarg; + do { + p1 = strchr(p, ','); + if (p1) + *p1 = 0; + for (f = columns; field_is_valid(f); f++) { + if (!strcmp(f->optname, p)) { +
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
On 30/10/18 13:00, Thomas Petazzoni wrote: > Hello Marcin, > > Thanks for the feedback. > > On Tue, 30 Oct 2018 13:37:37 +0100, Marcin Wojtas wrote: > >> You use _really_ archaic firmware, the bug you see is 99% caused by a >> bug already fixed long time ago (cleanup all PP2 BM pools correctly >> during exit boot services). Please grab the latest release: >> https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin >> and let know if you observe any further issues with vanilla kernel. > > Even if this was a bug in the UEFI firmware, shouldn't the kernel be > independent from that, by doing a proper reset/reinit of the HW ? > > I.e, isn't the firmware fix papering over a bug that should be fixed in > Linux mvpp2 driver anyway ? Absolutely. Leaving this unpatched in the kernel, with a 100% chance of memory corruption is just mad. I'm pretty sure there should be a way to sanely reset the interface before it starts repainting the memory. And if there is none, we must find a way to tell the user that the machine is a death trap. Really. M. PS: updating the FW to the version provided by Marcin indeed makes things much more reliable. Thanks for that. -- Jazz is not dead. It just smells funny...
Re: [PATCH v1 0/5] can: add SAE J1939 protocol
On 10/08/2018 11:48 AM, Oleksij Rempel wrote: > This series adds SAE J1939 support to the current kernel v4.19-rc6. > > This stack has long history, starting back in 27 Apr 2011, if not > earlier: > https://lists.openwall.net/netdev/2011/04/27/45 > > After major rework and testing it is a time to send it mainline. I've removed some trailing newlines and added the stack to linux-can-next/j1939. Marc -- Pengutronix e.K. | Marc Kleine-Budde | Industrial Linux Solutions| Phone: +49-231-2826-924 | Vertretung West/Dortmund | Fax: +49-5121-206917- | Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de | signature.asc Description: OpenPGP digital signature
Re: Latest net-next kernel 4.19.0+
On 10/30/2018 01:09 AM, Paweł Staszewski wrote: > > > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: >> >> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: >> >>> Indeed this is a bug. I would expect it to produce frequent errors >>> though as many odd-length >>> packets would trigger it. Do you have RXFCS? Regardless, how >>> frequently do you see the problem? >>> >> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to >> CHECKSUM_NONE >> >> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the >> bug you fixed. >> >> So we now need to also fix mlx5. >> >> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned >> earlier, >> plus __get_unaligned_cpu32() as you hinted. >> >> >> >> > > No RXFCS > > And this trace is rly frequently like once per 3/4 seconds > like below: > [28965.776864] vlan1490: hw csum failure Might be vlan related. Can you first check this : diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device *netdev, skb->ip_summed = CHECKSUM_COMPLETE; skb->csum = csum_unfold((__force __sum16)cqe->check_sum); if (network_depth > ETH_HLEN) - /* CQE csum is calculated from the IP header and does -* not cover VLAN headers (if present). This will add -* the checksum manually. -*/ - skb->csum = csum_partial(skb->data + ETH_HLEN, -network_depth - ETH_HLEN, -skb->csum); + /* Temporary debugging */ + skb->ip_summed = CHECKSUM_NONE; if (unlikely(netdev->features & NETIF_F_RXFCS)) skb->csum = csum_add(skb->csum, (__force __wsum)mlx5e_get_fcs(skb));
[Patch V5 net 03/11] net: hns3: bugfix for reporting unknown vector0 interrupt repeatly problem
The current driver supports handling two vector0 interrupts, reset and mailbox. When the hardware reports an interrupt of another type of interrupt source, if the driver does not process the interrupt, but enables the interrupt, the hardware will repeatedly report the unknown interrupt. Therefore, the driver enables the vector0 interrupt after clearing the known type of interrupt source. Other conditions are not enabled. Fixes: cd8c5c269b1d ("net: hns3: Fix for hclge_reset running repeatly problem") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 5234b53..2a63147 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2236,7 +2236,7 @@ static irqreturn_t hclge_misc_irq_handle(int irq, void *data) } /* clear the source of interrupt if it is not cause by reset */ - if (event_cause != HCLGE_VECTOR0_EVENT_RST) { + if (event_cause == HCLGE_VECTOR0_EVENT_MBX) { hclge_clear_event_cause(hdev, event_cause, clearval); hclge_enable_vector(&hdev->misc_vector, true); } -- 2.7.4
[Patch V5 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()
When hns3_nic_init_vector_data() fails to map ring to vector, it should cancel the netif_napi_add() that has been successfully done and then exits. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- V5: Fixes comments from Joe Perches --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 32f3aca8..0b4323b 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) struct hnae3_handle *h = priv->ae_handle; struct hns3_enet_tqp_vector *tqp_vector; int ret = 0; - u16 i; + int i; hns3_nic_set_cpumask(priv); @@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain); if (ret) - return ret; + goto map_ring_fail; netif_napi_add(priv->netdev, &tqp_vector->napi, hns3_nic_common_poll, NAPI_POLL_WEIGHT); } return 0; + +map_ring_fail: + while (i--) + netif_napi_del(&priv->tqp_vector[i].napi); + + return ret; } static int hns3_nic_alloc_vector_data(struct hns3_nic_priv *priv) -- 2.7.4
[Patch V5 net 08/11] net: hns3: fix incorrect return value/type of some functions
There are some functions that, when they fail to send the command, need to return the corresponding error value to its caller. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Fixes: 681ec3999b3d ("net: hns3: fix for vlan table lost problem when resetting") Signed-off-by: Huazhong Tan --- V2: Fixes the compilation error reported by kbuild test robot --- drivers/net/ethernet/hisilicon/hns3/hnae3.h| 6 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 80 +++--- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 2 +- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 34 - .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 2 +- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 14 ++-- 6 files changed, 85 insertions(+), 53 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index e82e4ca..055b406 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -316,8 +316,8 @@ struct hnae3_ae_ops { int (*set_loopback)(struct hnae3_handle *handle, enum hnae3_loop loop_mode, bool en); - void (*set_promisc_mode)(struct hnae3_handle *handle, bool en_uc_pmc, -bool en_mc_pmc); + int (*set_promisc_mode)(struct hnae3_handle *handle, bool en_uc_pmc, + bool en_mc_pmc); int (*set_mtu)(struct hnae3_handle *handle, int new_mtu); void (*get_pauseparam)(struct hnae3_handle *handle, @@ -391,7 +391,7 @@ struct hnae3_ae_ops { int vector_num, struct hnae3_ring_chain_node *vr_chain); - void (*reset_queue)(struct hnae3_handle *handle, u16 queue_id); + int (*reset_queue)(struct hnae3_handle *handle, u16 queue_id); u32 (*get_fw_version)(struct hnae3_handle *handle); void (*get_mdix_mode)(struct hnae3_handle *handle, u8 *tp_mdix_ctrl, u8 *tp_mdix); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index bf71c23..3f96aa3 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -509,16 +509,18 @@ static void hns3_nic_set_rx_mode(struct net_device *netdev) h->netdev_flags = new_flags; } -void hns3_update_promisc_mode(struct net_device *netdev, u8 promisc_flags) +int hns3_update_promisc_mode(struct net_device *netdev, u8 promisc_flags) { struct hns3_nic_priv *priv = netdev_priv(netdev); struct hnae3_handle *h = priv->ae_handle; if (h->ae_algo->ops->set_promisc_mode) { - h->ae_algo->ops->set_promisc_mode(h, - promisc_flags & HNAE3_UPE, - promisc_flags & HNAE3_MPE); + return h->ae_algo->ops->set_promisc_mode(h, + promisc_flags & HNAE3_UPE, + promisc_flags & HNAE3_MPE); } + + return 0; } void hns3_enable_vlan_filter(struct net_device *netdev, bool enable) @@ -1494,18 +1496,22 @@ static int hns3_vlan_rx_kill_vid(struct net_device *netdev, return ret; } -static void hns3_restore_vlan(struct net_device *netdev) +static int hns3_restore_vlan(struct net_device *netdev) { struct hns3_nic_priv *priv = netdev_priv(netdev); + int ret = 0; u16 vid; - int ret; for_each_set_bit(vid, priv->active_vlans, VLAN_N_VID) { ret = hns3_vlan_rx_add_vid(netdev, htons(ETH_P_8021Q), vid); - if (ret) - netdev_warn(netdev, "Restore vlan: %d filter, ret:%d\n", - vid, ret); + if (ret) { + netdev_err(netdev, "Restore vlan: %d filter, ret:%d\n", + vid, ret); + return ret; + } } + + return ret; } static int hns3_ndo_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, @@ -3257,11 +3263,12 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv) } /* Set mac addr if it is configured. or leave it to the AE driver */ -static void hns3_init_mac_addr(struct net_device *netdev, bool init) +static int hns3_init_mac_addr(struct net_device *netdev, bool init) { struct hns3_nic_priv *priv = netdev_priv(netdev); struct hnae3_handle *h = priv->ae_handle; u8 mac_addr_temp[ETH_ALEN]; + int ret = 0; if (h->ae_algo->ops->get_mac_addr && init) { h->ae_algo->ops->get_mac_addr(h, mac_addr_temp); @@ -3276,8 +3283,9 @@ static void hns3_init_mac_addr(struct net_device *netdev, bool init) } if (h->
[Patch V5 net 10/11] net: hns3: bugfix for rtnl_lock's range in the hclge_reset()
Since hclge_reset_wait() is used to wait for the hardware to complete the reset, it is not necessary to hold the rtnl_lock during hclge_reset_wait(). So this patch releases the lock for the duration of hclge_reset_wait(). Fixes: 6d4fab39533f ("net: hns3: Reset net device with rtnl_lock") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index f3212c9..ffdd960 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2470,14 +2470,17 @@ static void hclge_reset(struct hclge_dev *hdev) handle = &hdev->vport[0].nic; rtnl_lock(); hclge_notify_client(hdev, HNAE3_DOWN_CLIENT); + rtnl_unlock(); if (!hclge_reset_wait(hdev)) { + rtnl_lock(); hclge_notify_client(hdev, HNAE3_UNINIT_CLIENT); hclge_reset_ae_dev(hdev->ae_dev); hclge_notify_client(hdev, HNAE3_INIT_CLIENT); hclge_clear_reset_cause(hdev); } else { + rtnl_lock(); /* schedule again to check pending resets later */ set_bit(hdev->reset_type, &hdev->reset_pending); hclge_reset_task_schedule(hdev); -- 2.7.4
[Patch V5 net 11/11] net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset()
Since hclgevf_reset_wait() is used to wait for the hardware to complete the reset, it is not necessary to hold the rtnl_lock during hclgevf_reset_wait(). So this patch releases the lock for the duration of hclgevf_reset_wait(). Fixes: 6988eb2a9b77 ("net: hns3: Add support to reset the enet/ring mgmt layer") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index b224f6a..085edb9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -1170,6 +1170,8 @@ static int hclgevf_reset(struct hclgevf_dev *hdev) /* bring down the nic to stop any ongoing TX/RX */ hclgevf_notify_client(hdev, HNAE3_DOWN_CLIENT); + rtnl_unlock(); + /* check if VF could successfully fetch the hardware reset completion * status from the hardware */ @@ -1181,12 +1183,15 @@ static int hclgevf_reset(struct hclgevf_dev *hdev) ret); dev_warn(&hdev->pdev->dev, "VF reset failed, disabling VF!\n"); + rtnl_lock(); hclgevf_notify_client(hdev, HNAE3_UNINIT_CLIENT); rtnl_unlock(); return ret; } + rtnl_lock(); + /* now, re-initialize the nic client and ae device*/ ret = hclgevf_reset_stack(hdev); if (ret) -- 2.7.4
[Patch V5 net 02/11] net: hns3: bugfix for buffer not free problem during resetting
When hns3_get_ring_config()/hns3_queue_to_ring()/ hns3_get_vector_ring_chain() failed during resetting, the allocated memory has not been freed before these three functions return. So this patch adds error handler in these functions to fix it. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- V5: Fixes comments from Sergei Shtylyov add error handler for hns3_get_vector_ring_chain() --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 0b4323b..b767ff9 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2727,7 +2727,7 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, chain = devm_kzalloc(&pdev->dev, sizeof(*chain), GFP_KERNEL); if (!chain) - return -ENOMEM; + goto err_free_chain; cur_chain->next = chain; chain->tqp_index = tx_ring->tqp->tqp_index; @@ -2757,7 +2757,7 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, while (rx_ring) { chain = devm_kzalloc(&pdev->dev, sizeof(*chain), GFP_KERNEL); if (!chain) - return -ENOMEM; + goto err_free_chain; cur_chain->next = chain; chain->tqp_index = rx_ring->tqp->tqp_index; @@ -2772,6 +2772,16 @@ static int hns3_get_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, } return 0; + +err_free_chain: + cur_chain = head->next; + while (cur_chain) { + chain = cur_chain->next; + devm_kfree(&pdev->dev, chain); + cur_chain = chain; + } + + return -ENOMEM; } static void hns3_free_vector_ring_chain(struct hns3_enet_tqp_vector *tqp_vector, @@ -3037,8 +3047,10 @@ static int hns3_queue_to_ring(struct hnae3_queue *tqp, return ret; ret = hns3_ring_get_cfg(tqp, priv, HNAE3_RING_TYPE_RX); - if (ret) + if (ret) { + devm_kfree(priv->dev, priv->ring_data[tqp->tqp_index].ring); return ret; + } return 0; } @@ -3065,6 +3077,12 @@ static int hns3_get_ring_config(struct hns3_nic_priv *priv) return 0; err: + while (i--) { + devm_kfree(priv->dev, priv->ring_data[i].ring); + devm_kfree(priv->dev, + priv->ring_data[i + h->kinfo.num_tqps].ring); + } + devm_kfree(&pdev->dev, priv->ring_data); return ret; } -- 2.7.4
[Patch V5 net 05/11] net: hns3: remove unnecessary queue reset in the hns3_uninit_all_ring()
It is not necessary to reset the queue in the hns3_uninit_all_ring(), since the queue is stopped in the down operation, and will be reset in the up operation. And the judgment of the HCLGE_STATE_RST_HANDLING flag in the hclge_reset_tqp() is not correct, because we need to reset tqp during pf reset, otherwise it may cause queue not being reset to working state problem. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- V4: Fixes comments from Sergei Shtylyov V3: Fixes comments from Sergei Shtylyov --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 3 --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 3 --- 2 files changed, 6 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index b767ff9..bf71c23 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -3250,9 +3250,6 @@ int hns3_uninit_all_ring(struct hns3_nic_priv *priv) int i; for (i = 0; i < h->kinfo.num_tqps; i++) { - if (h->ae_algo->ops->reset_queue) - h->ae_algo->ops->reset_queue(h, i); - hns3_fini_ring(priv->ring_data[i].ring); hns3_fini_ring(priv->ring_data[i + h->kinfo.num_tqps].ring); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index 2a63147..4dd0506 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -6116,9 +6116,6 @@ void hclge_reset_tqp(struct hnae3_handle *handle, u16 queue_id) u16 queue_gid; int ret; - if (test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state)) - return; - queue_gid = hclge_covert_handle_qid_global(handle, queue_id); ret = hclge_tqp_enable(hdev, queue_id, 0, false); -- 2.7.4
[Patch V5 net 07/11] net: hns3: bugfix for hclge_mdio_write and hclge_mdio_read
When there is a PHY, the driver needs to complete some operations through MDIO during reset reinitialization, so HCLGE_STATE_CMD_DISABLE is more suitable than HCLGE_STATE_RST_HANDLING to prevent the MDIO operation from being sent during the hardware reset. Fixes: b50ae26c57cb ("net: hns3: never send command queue message to IMP when reset) Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c index 24b1f2a..0301863 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c @@ -52,7 +52,7 @@ static int hclge_mdio_write(struct mii_bus *bus, int phyid, int regnum, struct hclge_desc desc; int ret; - if (test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state)) + if (test_bit(HCLGE_STATE_CMD_DISABLE, &hdev->state)) return 0; hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_MDIO_CONFIG, false); @@ -90,7 +90,7 @@ static int hclge_mdio_read(struct mii_bus *bus, int phyid, int regnum) struct hclge_desc desc; int ret; - if (test_bit(HCLGE_STATE_RST_HANDLING, &hdev->state)) + if (test_bit(HCLGE_STATE_CMD_DISABLE, &hdev->state)) return 0; hclge_cmd_setup_basic_desc(&desc, HCLGE_OPC_MDIO_CONFIG, true); -- 2.7.4
[Patch V5 net 04/11] net: hns3: bugfix for the initialization of command queue's spin lock
The spin lock of the command queue only need to be initialized once when the driver initializes the command queue. It is not necessary to initialize the spin lock when resetting. At the same time, the modification of the queue member should be performed after acquiring the lock. Fixes: 3efb960f056d ("net: hns3: Refactor the initialization of command queue") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c index ac13cb2..68026a5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c @@ -304,6 +304,10 @@ int hclge_cmd_queue_init(struct hclge_dev *hdev) { int ret; + /* Setup the lock for command queue */ + spin_lock_init(&hdev->hw.cmq.csq.lock); + spin_lock_init(&hdev->hw.cmq.crq.lock); + /* Setup the queue entries for use cmd queue */ hdev->hw.cmq.csq.desc_num = HCLGE_NIC_CMQ_DESC_NUM; hdev->hw.cmq.crq.desc_num = HCLGE_NIC_CMQ_DESC_NUM; @@ -337,18 +341,20 @@ int hclge_cmd_init(struct hclge_dev *hdev) u32 version; int ret; + spin_lock_bh(&hdev->hw.cmq.csq.lock); + spin_lock_bh(&hdev->hw.cmq.crq.lock); + hdev->hw.cmq.csq.next_to_clean = 0; hdev->hw.cmq.csq.next_to_use = 0; hdev->hw.cmq.crq.next_to_clean = 0; hdev->hw.cmq.crq.next_to_use = 0; - /* Setup the lock for command queue */ - spin_lock_init(&hdev->hw.cmq.csq.lock); - spin_lock_init(&hdev->hw.cmq.crq.lock); - hclge_cmd_init_regs(&hdev->hw); clear_bit(HCLGE_STATE_CMD_DISABLE, &hdev->state); + spin_unlock_bh(&hdev->hw.cmq.crq.lock); + spin_unlock_bh(&hdev->hw.cmq.csq.lock); + ret = hclge_cmd_query_firmware_version(&hdev->hw, &version); if (ret) { dev_err(&hdev->pdev->dev, -- 2.7.4
[Patch V5 net 09/11] net: hns3: bugfix for handling mailbox while the command queue reinitialized
In a multi-core machine, the mailbox service and reset service will be executed at the same time. The reset service will re-initialize the command queue, before that, the mailbox handler can only get some invalid messages. The HCLGE_STATE_CMD_DISABLE flag means that the command queue is not available and needs to be reinitialized. Therefore, when the mailbox handler recognizes this flag, it should not process the command. Fixes: dde1a86e93ca ("net: hns3: Add mailbox support to PF driver") Signed-off-by: Huazhong Tan --- V3: Fixes comments from Sergei Shtylyov --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c index 04462a3..f890022 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c @@ -400,6 +400,12 @@ void hclge_mbx_handler(struct hclge_dev *hdev) /* handle all the mailbox requests in the queue */ while (!hclge_cmd_crq_empty(&hdev->hw)) { + if (test_bit(HCLGE_STATE_CMD_DISABLE, &hdev->state)) { + dev_warn(&hdev->pdev->dev, +"command queue needs re-initializing\n"); + return; + } + desc = &crq->desc[crq->next_to_use]; req = (struct hclge_mbx_vf_to_pf_cmd *)desc->data; -- 2.7.4
[Patch V5 net 06/11] net: hns3: bugfix for is_valid_csq_clean_head()
The HEAD pointer of the hardware command queue maybe equal to the command queue's next_to_use in the driver, so that does not belong to the invalid HEAD pointer, since the hardware may not process the command in time, causing the HEAD pointer to be too late to update. The variables' name in this function is unreadable, so give them a more readable one. Fixes: 3ff504908f95 ("net: hns3: fix a dead loop in hclge_cmd_csq_clean") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c index 68026a5..690f62e 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c @@ -24,15 +24,15 @@ static int hclge_ring_space(struct hclge_cmq_ring *ring) return ring->desc_num - used - 1; } -static int is_valid_csq_clean_head(struct hclge_cmq_ring *ring, int h) +static int is_valid_csq_clean_head(struct hclge_cmq_ring *ring, int head) { - int u = ring->next_to_use; - int c = ring->next_to_clean; + int ntu = ring->next_to_use; + int ntc = ring->next_to_clean; - if (unlikely(h >= ring->desc_num)) - return 0; + if (ntu > ntc) + return head >= ntc && head <= ntu; - return u > c ? (h > c && h <= u) : (h > c || h <= u); + return head >= ntc || head <= ntu; } static int hclge_alloc_cmd_desc(struct hclge_cmq_ring *ring) -- 2.7.4
[Patch V5 net 00/11] Bugfix for the HNS3 driver
This patch series include bugfix for the HNS3 ethernet controller driver. Change log: V4->V5: Fixes comments from Joe Perches & Sergei Shtylyov V3->V4: Fixes comments from Sergei Shtylyov V2->V3: Fixes comments from Sergei Shtylyov V1->V2: Fixes the compilation break reported by kbuild test robot http://patchwork.ozlabs.org/patch/989818/ Huazhong Tan (11): net: hns3: add error handler for hns3_nic_init_vector_data() net: hns3: bugfix for buffer not free problem during resetting net: hns3: bugfix for reporting unknown vector0 interrupt repeatly problem net: hns3: bugfix for the initialization of command queue's spin lock net: hns3: remove unnecessary queue reset in the hns3_uninit_all_ring() net: hns3: bugfix for is_valid_csq_clean_head() net: hns3: bugfix for hclge_mdio_write and hclge_mdio_read net: hns3: fix incorrect return value/type of some functions net: hns3: bugfix for handling mailbox while the command queue reinitialized net: hns3: bugfix for rtnl_lock's range in the hclge_reset() net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset() drivers/net/ethernet/hisilicon/hns3/hnae3.h| 6 +- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 117 +++-- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h| 2 +- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c | 26 +++-- .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 42 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h| 2 +- .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mbx.c | 6 ++ .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c| 4 +- .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c | 19 ++-- 9 files changed, 147 insertions(+), 77 deletions(-) -- 2.7.4
Re: [PATCH net] net: ethernet: cadence: fix socket buffer corruption problem
On 29.10.2018 23:40, Tristram Ha - C24268 wrote: >> Could you, please, tell me if the above variable was false in your case? >> >> bool cloned = skb_cloned(*skb) || skb_header_cloned(*skb); >> >> If yes, then, the proper fix would be as follows: >> >> diff --git a/drivers/net/ethernet/cadence/macb_main.c >> b/drivers/net/ethernet/cadence/macb_main.c >> index 8f5bf9166c11..492a8e1a34cd 100644 >> --- a/drivers/net/ethernet/cadence/macb_main.c >> +++ b/drivers/net/ethernet/cadence/macb_main.c >> @@ -1690,7 +1690,7 @@ static int macb_pad_and_fcs(struct sk_buff **skb, >> struct net_device *ndev) >> padlen += ETH_FCS_LEN; >> } >> >> - if (!cloned && headroom + tailroom >= padlen) { >> + if (!cloned && headroom + tailroom >= ETH_FCS_LEN) { >> (*skb)->data = memmove((*skb)->head, (*skb)->data, >> (*skb)->len); >> skb_set_tail_pointer(*skb, (*skb)->len); >> } else { >> >> Could you please check if it works in your case (and without your patch)? >> > > Actually doing that reveals another bug: > > if (padlen) { > if (padlen >= ETH_FCS_LEN) > skb_put_zero(*skb, padlen - ETH_FCS_LEN); > else > skb_trim(*skb, ETH_FCS_LEN - padlen); > } > > My fix calls skb_put_zero with zero length. Your change calls skb_trim which > actually sets the socket buffer length to 1! > > When this problem happens headroom is 2, tailroom is 1, skb->len is 61, and > padlen is 3. > Ok, I see now. Looking again, your fix is good. But, as you said, there is the skb_trim() in this function that is wrong from the beginning (my bad). I propose to remove it since, with your fix is not even reached anymore. Could you check on your side that applying this on top of your patch, your scenario is still working? diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index 1d86b4d5645a..e1347d6d1b50 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -1702,12 +1702,8 @@ static int macb_pad_and_fcs(struct sk_buff **skb, struct net_device *ndev) *skb = nskb; } - if (padlen) { - if (padlen >= ETH_FCS_LEN) - skb_put_zero(*skb, padlen - ETH_FCS_LEN); - else - skb_trim(*skb, ETH_FCS_LEN - padlen); - } + if (padlen > ETH_FCS_LEN) + skb_put_zero(*skb, padlen - ETH_FCS_LEN); > DSA driver is being used. That is why the length is already padded to 60 > bytes > and 1-byte tail tag is added. Ok, I see, I didn't test with such configurations. > > BTW, I am not sure while this macb_pad_and_fcs function was added. Is it to > workaround some hardware bugs? The code is executed only when > NETIF_IF_HW_CSUM is used. But if hardware tx checksumming is enabled why > not also use the hardware to calculate CRC? It was reported in [1] that UDP checksum is offloaded to hardware no matter the application previously computed it. The code should be executed only for packets that has checksum computed by applications ((*skb)->ip_summed != CHECKSUM_PARTIAL). The idea was to not recompute checksum for packets with checksum already computed. To do so, while hardware checksum is enabled (NETIF_F_HW_CSUM), TX_NOCRC bit should be set on buffer descriptor. But to do so, packets must have a minimum size of 64 and FCS to be computed. The NETIF_F_HW_CSUM check was placed there because the issue described in [1] is reproducible because hardware checksum is enabled and overrides the checksum provided by applications. [1] https://www.spinics.net/lists/netdev/msg505065.html > > NETIF_F_SG is not enabled in the MAC I used, so enabling NETIF_IF_HW_CSUM > is rather pointless. With the padding code the transmit throughput cannot get > higher than 100Mbps in a gigabit connection. > > I would recommend to add this option to disable manual padding in one of those > macb_config structures. In this way the user would have to know from the beginning what kind of packets are used. Thank you, Claudiu Beznea >
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
Hello Marcin, Thanks for the feedback. On Tue, 30 Oct 2018 13:37:37 +0100, Marcin Wojtas wrote: > You use _really_ archaic firmware, the bug you see is 99% caused by a > bug already fixed long time ago (cleanup all PP2 BM pools correctly > during exit boot services). Please grab the latest release: > https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin > and let know if you observe any further issues with vanilla kernel. Even if this was a bug in the UEFI firmware, shouldn't the kernel be independent from that, by doing a proper reset/reinit of the HW ? I.e, isn't the firmware fix papering over a bug that should be fixed in Linux mvpp2 driver anyway ? Best regards, Thomas -- Thomas Petazzoni, CTO, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
Marcin, On 30/10/18 12:37, Marcin Wojtas wrote: > [Resend in UTF-8] > > Hi Marc, > > You use _really_ archaic firmware, the bug you see is 99% caused by a Please let me fix this for you: s/_really_ archaic/released/ > bug already fixed long time ago (cleanup all PP2 BM pools correctly > during exit boot services). How long ago? Why didn't you say so when I reported the bug to you and Antoine back in January? Also, why isn't that "clean-up" taken care of by the Linux driver? Exiting boot services itself doesn't seem to cause the issue, and it is setting the interface up that causes it. > Please grab the latest release: > https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin > and let know if you observe any further issues with vanilla kernel. What does this image contain? Thanks, M. -- Jazz is not dead. It just smells funny...
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
[Resend in UTF-8] Hi Marc, You use _really_ archaic firmware, the bug you see is 99% caused by a bug already fixed long time ago (cleanup all PP2 BM pools correctly during exit boot services). Please grab the latest release: https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/wiki/files/flash-image-18.09.4.bin and let know if you observe any further issues with vanilla kernel. Best regards, Marcin wt., 30 paź 2018 o 13:16 Marc Zyngier napisał(a): > > Antoine, > > On 30/10/18 10:50, Antoine Tenart wrote: > > Marc, > > > > On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote: > >> > >> This is a follow-up on the conversation Thomas and I had last week at > >> ELC, with me ranting at the sorry state of the MVPP2 driver. > > > >> Triggering this is dead simple: > >> - Add a macvtap to one of the MVPP2 interfaces > >> - Bring it online > >> - Watch the kernel exploding and memory being corrupted > >> > >> You don't even need anything listening on the tap interface, just its > >> simple existence triggers it. I use a similar setup on a large variety > >> of machines, and this box is the only one that catches fire. Removing > >> the macvtap interface makes it (more) reliable. > >> > >> Given that I cannot reproduce this issue on any other ARM (32 or 64bit) > >> platform, including other Marvell stuff, I can only conclude that the > >> MVPP2 driver is responsible for this. > >> > >> Example crash and .config below (4.19 vanilla, as linux/master dies in > >> new and wonderful ways on this box). I'm looking forward to testing any > >> idea you may have. > > > > I used a 4.19 vanilla kernel, with both your configuration and mine, > > on 2 different Macchiatobins, but was unable to trigger the issue: > > > > # ip link set eth0 up > > # ip link add link eth0 name macvtap0 type macvtap > > # ip link set macvtap0 up> > > I can even configure the eth0/macvtap0 interfaces, and use them > > generating or receiving tcp/udp/icmp traffic. > > > > (I also made other tests using macvtap and tap interfaces). > > > > How much memory do you have on the board? What version of ATF are you > > using? Version of U-Boot? > > 4GB of RAM. As for the version numbers, see below. I don't use u-boot, > but UEFI (EDK-II v2.60). The problem can be reproduced on two different > machines, with the same configuration (and firmwares dating from a > similar era): > > Starting CP-0 IOROM 1.07 > Booting from SD 0 (0x29) > Found valid image at boot postion 0x002 > lNOTICE: Starting binary extension > NOTICE: Gathering DRAM information > mv_ddr: mv_ddr-armada-17.06.1-g47f4c8b (Jun 2 2017 - 17:07:23) > mv_ddr: completed successfully > NOTICE: Booting Trusted Firmware > NOTICE: BL1: v1.3(release):armada-17.06.2:297d68f > NOTICE: BL1: Built : 17:07:27, Jun 2 2017 > NOTICE: BL1: Booting BL2 > lNOTICE: BL2: v1.3(release):armada-17.06.2:297d68f > NOTICE: BL2: Built : 17:07:28, Jun 2 2017 > NOTICE: BL1: Booting BL31 > lNOTICE: BL31: v1.3(release):armada-17.06.2:297d68f > NOTICE: BL31: Built : 17:07:30, Jun 2 2017 > lUEFI firmware (version MARVELL_EFI built at 17:12:21 on Jun 2 2017) > > Armada 8040 MachiatoBin Platform Init > > Comphy0-0: PCIE0 5 Gbps > Comphy0-1: PCIE0 5 Gbps > Comphy0-2: PCIE0 5 Gbps > Comphy0-3: PCIE0 5 Gbps > Comphy0-4: SFI 10.31 Gbps > Comphy0-5: SATA1 5 Gbps > > Comphy1-0: SGMII11.25 Gbps > Comphy1-1: SATA2 5 Gbps > Comphy1-2: USB3_HOST05 Gbps > Comphy1-3: SATA3 5 Gbps > Comphy1-4: SFI 10.31 Gbps > Comphy1-5: SGMII23.125 Gbps > > UTMI PHY 0 initialized to USB Host0 > UTMI PHY 1 initialized to USB Host1 > UTMI PHY 0 initialized to USB Host0 > RTC: Initialize controller 1 > Skip I2c chip 0 > Succesfully installed protocol interfaces > ramdisk:blckio install. Status=Success > > With the latest mainline, and after fixing that other irq affinity > bug (see patch posted yesterday), I only need to bring the interface > up, without doing anything else: > > # ip link set eth0 up > [ 155.507877] mvpp2 f200.ethernet eth0: PHY [f212a600.mdio-mii:00] > driver [mv88x3310] > [ 155.526732] mvpp2 f200.ethernet eth0: configuring for phy/10gbase-kr > link mode > [ 157.592581] mvpp2 f200.ethernet eth0: Link is Up - 1Gbps/Full - flow > control rx/tx > [ 158.339396] BUG: Bad page state in process swapper/0 pfn:e6804 > [ 158.345345] page:7e00039a0100 count:0 mapcount:0 > mapping:8000e7bf3b00 index:0x8000e6804c00 > [ 158.354696] flags: 0xfffc200(slab) > [ 158.358815] raw: 0fffc200 7e00039cff80 00040004 > 8000e7bf3b00 > [ 158.366594] raw: 8000e6804c00 801f > > [ 158.374371] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set > [ 158.380840] bad because of flags: 0x200(slab) > [ 158.385216] Modules linked in: > [ 158.388288] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 4.19.0-09420-g34ae82ac683
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
Antoine, On 30/10/18 10:50, Antoine Tenart wrote: > Marc, > > On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote: >> >> This is a follow-up on the conversation Thomas and I had last week at >> ELC, with me ranting at the sorry state of the MVPP2 driver. > >> Triggering this is dead simple: >> - Add a macvtap to one of the MVPP2 interfaces >> - Bring it online >> - Watch the kernel exploding and memory being corrupted >> >> You don't even need anything listening on the tap interface, just its >> simple existence triggers it. I use a similar setup on a large variety >> of machines, and this box is the only one that catches fire. Removing >> the macvtap interface makes it (more) reliable. >> >> Given that I cannot reproduce this issue on any other ARM (32 or 64bit) >> platform, including other Marvell stuff, I can only conclude that the >> MVPP2 driver is responsible for this. >> >> Example crash and .config below (4.19 vanilla, as linux/master dies in >> new and wonderful ways on this box). I'm looking forward to testing any >> idea you may have. > > I used a 4.19 vanilla kernel, with both your configuration and mine, > on 2 different Macchiatobins, but was unable to trigger the issue: > > # ip link set eth0 up > # ip link add link eth0 name macvtap0 type macvtap > # ip link set macvtap0 up> > I can even configure the eth0/macvtap0 interfaces, and use them > generating or receiving tcp/udp/icmp traffic. > > (I also made other tests using macvtap and tap interfaces). > > How much memory do you have on the board? What version of ATF are you > using? Version of U-Boot? 4GB of RAM. As for the version numbers, see below. I don't use u-boot, but UEFI (EDK-II v2.60). The problem can be reproduced on two different machines, with the same configuration (and firmwares dating from a similar era): Starting CP-0 IOROM 1.07 Booting from SD 0 (0x29) Found valid image at boot postion 0x002 lNOTICE: Starting binary extension NOTICE: Gathering DRAM information mv_ddr: mv_ddr-armada-17.06.1-g47f4c8b (Jun 2 2017 - 17:07:23) mv_ddr: completed successfully NOTICE: Booting Trusted Firmware NOTICE: BL1: v1.3(release):armada-17.06.2:297d68f NOTICE: BL1: Built : 17:07:27, Jun 2 2017 NOTICE: BL1: Booting BL2 lNOTICE: BL2: v1.3(release):armada-17.06.2:297d68f NOTICE: BL2: Built : 17:07:28, Jun 2 2017 NOTICE: BL1: Booting BL31 lNOTICE: BL31: v1.3(release):armada-17.06.2:297d68f NOTICE: BL31: Built : 17:07:30, Jun 2 2017 lUEFI firmware (version MARVELL_EFI built at 17:12:21 on Jun 2 2017) Armada 8040 MachiatoBin Platform Init Comphy0-0: PCIE0 5 Gbps Comphy0-1: PCIE0 5 Gbps Comphy0-2: PCIE0 5 Gbps Comphy0-3: PCIE0 5 Gbps Comphy0-4: SFI 10.31 Gbps Comphy0-5: SATA1 5 Gbps Comphy1-0: SGMII11.25 Gbps Comphy1-1: SATA2 5 Gbps Comphy1-2: USB3_HOST05 Gbps Comphy1-3: SATA3 5 Gbps Comphy1-4: SFI 10.31 Gbps Comphy1-5: SGMII23.125 Gbps UTMI PHY 0 initialized to USB Host0 UTMI PHY 1 initialized to USB Host1 UTMI PHY 0 initialized to USB Host0 RTC: Initialize controller 1 Skip I2c chip 0 Succesfully installed protocol interfaces ramdisk:blckio install. Status=Success With the latest mainline, and after fixing that other irq affinity bug (see patch posted yesterday), I only need to bring the interface up, without doing anything else: # ip link set eth0 up [ 155.507877] mvpp2 f200.ethernet eth0: PHY [f212a600.mdio-mii:00] driver [mv88x3310] [ 155.526732] mvpp2 f200.ethernet eth0: configuring for phy/10gbase-kr link mode [ 157.592581] mvpp2 f200.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx [ 158.339396] BUG: Bad page state in process swapper/0 pfn:e6804 [ 158.345345] page:7e00039a0100 count:0 mapcount:0 mapping:8000e7bf3b00 index:0x8000e6804c00 [ 158.354696] flags: 0xfffc200(slab) [ 158.358815] raw: 0fffc200 7e00039cff80 00040004 8000e7bf3b00 [ 158.366594] raw: 8000e6804c00 801f [ 158.374371] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set [ 158.380840] bad because of flags: 0x200(slab) [ 158.385216] Modules linked in: [ 158.388288] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0-09420-g34ae82ac683c #278 [ 158.396148] Hardware name: Marvell 8040 MACCHIATOBin (DT) [ 158.401567] Call trace: [ 158.404031] dump_backtrace+0x0/0x148 [ 158.407708] show_stack+0x14/0x20 [ 158.411036] dump_stack+0x90/0xb4 [ 158.414365] bad_page+0x104/0x130 [ 158.417692] free_pages_check_bad+0x9c/0xa8 [ 158.421892] __free_pages_ok+0x1b0/0x450 [ 158.425829] page_frag_free+0x8c/0xa8 [ 158.429505] skb_free_head+0x18/0x30 [ 158.433093] skb_release_data+0x130/0x160 [ 158.437117] skb_release_all+0x24/0x30 [ 158.440881] consume_skb+0x2c/0x58 [ 158.444296] arp_process.constprop.4+0x200/0x6f0 [ 158.448931] arp_rcv+0xf4/0x128 [ 158.45208
Re: [PATCH] xfrm: Fix error return code in xfrm_output_one()
On Sat, Oct 27, 2018 at 06:12:06AM +, Wei Yongjun wrote: > xfrm_output_one() does not return a error code when there is > no dst_entry attached to the skb, it is still possible crash > with a NULL pointer dereference in xfrm_output_resume(). Fix > it by return error code -EHOSTUNREACH. > > Fixes: 9e1437937807 ("xfrm: Fix NULL pointer dereference when skb_dst_force > clears the dst_entry.") > Signed-off-by: Wei Yongjun Applied, thanks a lot!
Re: [PATCH net v5] net/ipv6: Add anycast addresses to a global hashtable
I originally started implementing it the way you suggested; however, it seemed to complicate management of that structure because it isn't currently using rcu. Also, assuming that can be worked out, where would I get the net from? Would I need to store a copy in ifcaddr6, or is there some way to access it during ipv6_chk_acast_addr()? It seems that if I don't add a copy of net, but instead access it through aca_rt(?), then freeing the ifcaddr6 memory becomes problematic (detaching it from idev, while read_rcu may still be accessing it). On Mon, Oct 29, 2018 at 11:32 PM David Miller wrote: > > From: Jeff Barnhill <0xeff...@gmail.com> > Date: Sun, 28 Oct 2018 01:51:59 + > > > +struct ipv6_ac_addrlist { > > + struct in6_addr acal_addr; > > + possible_net_t acal_pnet; > > + refcount_t acal_users; > > + struct hlist_node acal_lst; /* inet6_acaddr_lst */ > > + struct rcu_head rcu; > > +}; > > Please just add the hlist to ifcaddr6 instead of duplicating so much > information and reference counters here. > > This seems to waste a lot of memory unnecessary and add lots of > unnecessary object allocate/setup/destroy logic.
Re: Fw: [Bug 201423] New: eth0: hw csum failure
On 30.10.2018 11:58, Andre Tomt wrote: On 27.10.2018 23:41, Andre Tomt wrote: On 26.10.2018 13:45, Andre Tomt wrote: On 25.10.2018 19:38, Eric Dumazet wrote: On 10/24/2018 12:41 PM, Andre Tomt wrote: It eventually showed up again with mlx4, on 4.18.16 + fix and also on 4.19. I still do not have a useful packet capture. It is running a torrent client serving up various linux distributions. Have you also applied this fix ? https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 No. I've applied it now to 4.19 and will report back if anything shows up. Just hit it on the simpler server; no VRF, no tunnels, no nat/conntrack. Only a basic stateless nftables ruleset and a vlan netdev (unlikely to be the one triggering this I guess; it has only v4 traffic). I'm currently testing 4.19 with the recomended commit added, plus these to sort out some GRO issues (on a hunch, unsure if related): https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894 and I *think* it is behaving better now? it's not conclusive as it could take a while to trip in this environment but some of the test servers have not shown anything bad in almost 24h. Sorry, s/some of the/none of the
Re: Fw: [Bug 201423] New: eth0: hw csum failure
On 27.10.2018 23:41, Andre Tomt wrote: On 26.10.2018 13:45, Andre Tomt wrote: On 25.10.2018 19:38, Eric Dumazet wrote: On 10/24/2018 12:41 PM, Andre Tomt wrote: It eventually showed up again with mlx4, on 4.18.16 + fix and also on 4.19. I still do not have a useful packet capture. It is running a torrent client serving up various linux distributions. Have you also applied this fix ? https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 No. I've applied it now to 4.19 and will report back if anything shows up. Just hit it on the simpler server; no VRF, no tunnels, no nat/conntrack. Only a basic stateless nftables ruleset and a vlan netdev (unlikely to be the one triggering this I guess; it has only v4 traffic). I'm currently testing 4.19 with the recomended commit added, plus these to sort out some GRO issues (on a hunch, unsure if related): https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=a8305bff685252e80b7c60f4f5e7dd2e63e38218 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=992cba7e276d438ac8b0a8c17b147b37c8c286f7 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=ece23711dd956cd5053c9cb03e9fe0668f9c8894 and I *think* it is behaving better now? it's not conclusive as it could take a while to trip in this environment but some of the test servers have not shown anything bad in almost 24h.
Re: [BUG] MVPP2 driver exploding in presence of a tap interface
Marc, On Mon, Oct 29, 2018 at 03:05:53PM +, Marc Zyngier wrote: > > This is a follow-up on the conversation Thomas and I had last week at > ELC, with me ranting at the sorry state of the MVPP2 driver. > Triggering this is dead simple: > - Add a macvtap to one of the MVPP2 interfaces > - Bring it online > - Watch the kernel exploding and memory being corrupted > > You don't even need anything listening on the tap interface, just its > simple existence triggers it. I use a similar setup on a large variety > of machines, and this box is the only one that catches fire. Removing > the macvtap interface makes it (more) reliable. > > Given that I cannot reproduce this issue on any other ARM (32 or 64bit) > platform, including other Marvell stuff, I can only conclude that the > MVPP2 driver is responsible for this. > > Example crash and .config below (4.19 vanilla, as linux/master dies in > new and wonderful ways on this box). I'm looking forward to testing any > idea you may have. I used a 4.19 vanilla kernel, with both your configuration and mine, on 2 different Macchiatobins, but was unable to trigger the issue: # ip link set eth0 up # ip link add link eth0 name macvtap0 type macvtap # ip link set macvtap0 up I can even configure the eth0/macvtap0 interfaces, and use them generating or receiving tcp/udp/icmp traffic. (I also made other tests using macvtap and tap interfaces). How much memory do you have on the board? What version of ATF are you using? Version of U-Boot? Antoine -- Antoine Ténart, Bootlin Embedded Linux and Kernel engineering https://bootlin.com
Re: [Patch V4 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()
On 2018/10/30 17:11, Sergei Shtylyov wrote: On 10/29/2018 4:54 PM, Huazhong Tan wrote: When hns3_nic_init_vector_data() fails to map ring to vector, it should cancel the netif_napi_add() that has been successfully done and then exits. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 32f3aca8..d9066c5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) struct hnae3_handle *h = priv->ae_handle; struct hns3_enet_tqp_vector *tqp_vector; int ret = 0; - u16 i; + int i, j; hns3_nic_set_cpumask(priv); @@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain); if (ret) - return ret; + goto map_ring_fail; netif_napi_add(priv->netdev, &tqp_vector->napi, hns3_nic_common_poll, NAPI_POLL_WEIGHT); } return 0; + +map_ring_fail: + for (j = i - 1; j >= 0; j--) + netif_napi_del(&priv->tqp_vector[j].napi); 'j' doesn't seem needed as well. yes, it will be change to below one. + +map_ring_fail: + while(i--) + netif_napi_del(&priv->tqp_vector[i].napi); + + return ret; [...] MBR, Sergei . Thanks, Huazhong.
Re: [Patch V4 net 02/11] net: hns3: add error handler for hns3_get_ring_config/hns3_queue_to_ring
On 2018/10/30 17:09, Sergei Shtylyov wrote: Hello! On 10/29/2018 4:54 PM, Huazhong Tan wrote: When hns3_get_ring_config()/hns3_queue_to_ring() failed during resetting, the allocated memory has not been freed before hns3_get_ring_config() and hns3_queue_to_ring() return. So this patch fixes the buffer not freeing problem during resetting. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index d9066c5..6f0fd62 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c [...] @@ -3047,7 +3049,7 @@ static int hns3_get_ring_config(struct hns3_nic_priv *priv) { struct hnae3_handle *h = priv->ae_handle; struct pci_dev *pdev = h->pdev; - int i, ret; + int i, j, ret; priv->ring_data = devm_kzalloc(&pdev->dev, array3_size(h->kinfo.num_tqps, @@ -3065,6 +3067,12 @@ static int hns3_get_ring_config(struct hns3_nic_priv *priv) return 0; err: + for (j = i - 1; j >= 0; j--) { As is with the other patch, you don't need 'j' here. Yes, i have modified it. Thanks. + devm_kfree(priv->dev, priv->ring_data[j].ring); + devm_kfree(priv->dev, + priv->ring_data[j + h->kinfo.num_tqps].ring); + } + devm_kfree(&pdev->dev, priv->ring_data); return ret; } MBR, Sergei Greeting. Huazhong. .
Re: [Patch V4 net 01/11] net: hns3: add error handler for hns3_nic_init_vector_data()
On 10/29/2018 4:54 PM, Huazhong Tan wrote: When hns3_nic_init_vector_data() fails to map ring to vector, it should cancel the netif_napi_add() that has been successfully done and then exits. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 32f3aca8..d9066c5 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -2821,7 +2821,7 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) struct hnae3_handle *h = priv->ae_handle; struct hns3_enet_tqp_vector *tqp_vector; int ret = 0; - u16 i; + int i, j; hns3_nic_set_cpumask(priv); @@ -2868,13 +2868,19 @@ static int hns3_nic_init_vector_data(struct hns3_nic_priv *priv) hns3_free_vector_ring_chain(tqp_vector, &vector_ring_chain); if (ret) - return ret; + goto map_ring_fail; netif_napi_add(priv->netdev, &tqp_vector->napi, hns3_nic_common_poll, NAPI_POLL_WEIGHT); } return 0; + +map_ring_fail: + for (j = i - 1; j >= 0; j--) + netif_napi_del(&priv->tqp_vector[j].napi); 'j' doesn't seem needed as well. [...] MBR, Sergei
[bug report] PCI: Remove NULL device handling from PCI DMA API
Hello Christoph Hellwig, The patch 4167b2ad5182: "PCI: Remove NULL device handling from PCI DMA API" from Jan 10, 2018, leads to the following static checker warning: drivers/net/ethernet/amd/pcnet32.c:1921 pcnet32_probe1() warn: variable dereferenced before check 'pdev' (see line 1843) drivers/net/ethernet/amd/pcnet32.c 1839 1840 dev->base_addr = ioaddr; 1841 lp = netdev_priv(dev); 1842 /* pci_alloc_consistent returns page-aligned memory, so we do not have to check the alignment */ 1843 lp->init_block = pci_alloc_consistent(pdev, sizeof(*lp->init_block), This function is called with a NULL "pdev" when we're probing from pcnet32_probe_vlbus(). 1844&lp->init_dma_addr); 1845 if (!lp->init_block) { 1846 if (pcnet32_debug & NETIF_MSG_PROBE) 1847 pr_err("Consistent memory allocation failed\n"); 1848 ret = -ENOMEM; 1849 goto err_free_netdev; 1850 } 1851 lp->pci_dev = pdev; 1852 1853 lp->dev = dev; 1854 regards, dan carpenter
Re: [Patch V4 net 02/11] net: hns3: add error handler for hns3_get_ring_config/hns3_queue_to_ring
Hello! On 10/29/2018 4:54 PM, Huazhong Tan wrote: When hns3_get_ring_config()/hns3_queue_to_ring() failed during resetting, the allocated memory has not been freed before hns3_get_ring_config() and hns3_queue_to_ring() return. So this patch fixes the buffer not freeing problem during resetting. Fixes: 76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Huazhong Tan --- drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index d9066c5..6f0fd62 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c [...] @@ -3047,7 +3049,7 @@ static int hns3_get_ring_config(struct hns3_nic_priv *priv) { struct hnae3_handle *h = priv->ae_handle; struct pci_dev *pdev = h->pdev; - int i, ret; + int i, j, ret; priv->ring_data = devm_kzalloc(&pdev->dev, array3_size(h->kinfo.num_tqps, @@ -3065,6 +3067,12 @@ static int hns3_get_ring_config(struct hns3_nic_priv *priv) return 0; err: + for (j = i - 1; j >= 0; j--) { As is with the other patch, you don't need 'j' here. + devm_kfree(priv->dev, priv->ring_data[j].ring); + devm_kfree(priv->dev, + priv->ring_data[j + h->kinfo.num_tqps].ring); + } + devm_kfree(&pdev->dev, priv->ring_data); return ret; } MBR, Sergei
Re: Latest net-next kernel 4.19.0+
W dniu 30.10.2018 o 08:29, Eric Dumazet pisze: On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: Indeed this is a bug. I would expect it to produce frequent errors though as many odd-length packets would trigger it. Do you have RXFCS? Regardless, how frequently do you see the problem? Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted. No RXFCS And this trace is rly frequently like once per 3/4 seconds like below: [28965.776864] vlan1490: hw csum failure [28965.776867] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28965.776868] Call Trace: [28965.776870] [28965.776876] dump_stack+0x46/0x5b [28965.776879] __skb_checksum_complete+0x9a/0xa0 [28965.776882] tcp_v4_rcv+0xef/0x960 [28965.776884] ip_local_deliver_finish+0x49/0xd0 [28965.776886] ip_local_deliver+0x5e/0xe0 [28965.776888] ? ip_sublist_rcv_finish+0x50/0x50 [28965.776889] ip_rcv+0x41/0xc0 [28965.776891] __netif_receive_skb_one_core+0x4b/0x70 [28965.776893] netif_receive_skb_internal+0x2f/0xd0 [28965.776894] napi_gro_receive+0xb7/0xe0 [28965.776897] mlx5e_handle_rx_cqe+0x7a/0xd0 [28965.776899] mlx5e_poll_rx_cq+0xc6/0x930 [28965.776900] mlx5e_napi_poll+0xab/0xc90 [28965.776904] ? kmem_cache_free_bulk+0x1e4/0x280 [28965.776905] net_rx_action+0x1f1/0x320 [28965.776909] __do_softirq+0xec/0x2b7 [28965.776912] irq_exit+0x7b/0x80 [28965.776913] do_IRQ+0x45/0xc0 [28965.776915] common_interrupt+0xf/0xf [28965.776916] [28965.776918] RIP: 0010:mwait_idle+0x5f/0x1b0 [28965.776919] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [28965.776920] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: ffd3 [28965.776921] RAX: RBX: RCX: [28965.776922] RDX: RSI: RDI: [28965.776922] RBP: R08: 00aa R09: 88046f81fbc0 [28965.776923] R10: R11: 0001006d5985 R12: 8220f780 [28965.776924] R13: 8220f780 R14: R15: [28965.776927] do_idle+0x1a3/0x1c0 [28965.776929] cpu_startup_entry+0x14/0x20 [28965.776932] start_kernel+0x488/0x4a8 [28965.776935] secondary_startup_64+0xa4/0xb0 [28965.981529] vlan1490: hw csum failure [28965.981531] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28965.981532] Call Trace: [28965.981534] [28965.981539] dump_stack+0x46/0x5b [28965.981543] __skb_checksum_complete+0x9a/0xa0 [28965.981545] tcp_v4_rcv+0xef/0x960 [28965.981548] ip_local_deliver_finish+0x49/0xd0 [28965.981550] ip_local_deliver+0x5e/0xe0 [28965.981551] ? ip_sublist_rcv_finish+0x50/0x50 [28965.981552] ip_rcv+0x41/0xc0 [28965.981555] __netif_receive_skb_one_core+0x4b/0x70 [28965.981556] netif_receive_skb_internal+0x2f/0xd0 [28965.981558] napi_gro_receive+0xb7/0xe0 [28965.981560] mlx5e_handle_rx_cqe+0x7a/0xd0 [28965.981562] mlx5e_poll_rx_cq+0xc6/0x930 [28965.981563] mlx5e_napi_poll+0xab/0xc90 [28965.981567] ? kmem_cache_free_bulk+0x1e4/0x280 [28965.981568] net_rx_action+0x1f1/0x320 [28965.981571] __do_softirq+0xec/0x2b7 [28965.981575] irq_exit+0x7b/0x80 [28965.981576] do_IRQ+0x45/0xc0 [28965.981578] common_interrupt+0xf/0xf [28965.981579] [28965.981580] RIP: 0010:mwait_idle+0x5f/0x1b0 [28965.981582] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0 [28965.981583] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: ffd3 [28965.981584] RAX: RBX: RCX: [28965.981585] RDX: RSI: RDI: [28965.981586] RBP: R08: 0383 R09: 88046f81fbc0 [28965.981586] R10: R11: 0001006d59b8 R12: 8220f780 [28965.981587] R13: 8220f780 R14: R15: [28965.981591] do_idle+0x1a3/0x1c0 [28965.981592] cpu_startup_entry+0x14/0x20 [28965.981596] start_kernel+0x488/0x4a8 [28965.981600] secondary_startup_64+0xa4/0xb0 [28966.511782] vlan1490: hw csum failure [28966.511785] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1 [28966.511785] Call Trace: [28966.511787] [28966.511793] dump_stack+0x46/0x5b [28966.511797] __skb_checksum_complete+0x9a/0xa0 [28966.511799] tcp_v4_rcv+0xef/0x960 [28966.511802] ip_local_deliver_finish+0x49/0xd0 [28966.511804] ip_local_deliver+0x5e/0xe0 [28966.511806] ? ip_sublist_rcv_finish+0x50/0x50 [
[PATCH net] net/mlx5e: fix csum adjustments caused by RXFCS
As shown by Dmitris, we need to use csum_block_add() instead of csum_add() when adding the FCS contribution to skb csum. Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away, so RXFCS changes were ignored. Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling each other. Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5. Note that this patch also rewrites mlx5e_get_fcs() to : - Use skb_header_pointer() instead of reinventing it. - Use __get_unaligned_cpu32() to avoid possible non aligned accesses as Dmitris pointed out. Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation") Reported-by: Paweł Staszewski Signed-off-by: Eric Dumazet Cc: Eran Ben Elisha Cc: Saeed Mahameed Cc: Dimitris Michailidis Cc: Cong Wang Cc: Paweł Staszewski --- .../net/ethernet/mellanox/mlx5/core/en_rx.c | 45 --- 1 file changed, 9 insertions(+), 36 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 94224c22ecc310a87b6715051e335446f29bec03..79638dcbae78395fb723c9bf3fa877e7a42d91cd 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -713,43 +713,15 @@ static inline void mlx5e_enable_ecn(struct mlx5e_rq *rq, struct sk_buff *skb) rq->stats->ecn_mark += !!rc; } -static __be32 mlx5e_get_fcs(struct sk_buff *skb) +static u32 mlx5e_get_fcs(const struct sk_buff *skb) { - int last_frag_sz, bytes_in_prev, nr_frags; - u8 *fcs_p1, *fcs_p2; - skb_frag_t *last_frag; - __be32 fcs_bytes; + const void *fcs_bytes; + u32 _fcs_bytes; - if (!skb_is_nonlinear(skb)) - return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN); + fcs_bytes = skb_header_pointer(skb, skb->len - ETH_FCS_LEN, + ETH_FCS_LEN, &_fcs_bytes); - nr_frags = skb_shinfo(skb)->nr_frags; - last_frag = &skb_shinfo(skb)->frags[nr_frags - 1]; - last_frag_sz = skb_frag_size(last_frag); - - /* If all FCS data is in last frag */ - if (last_frag_sz >= ETH_FCS_LEN) - return *(__be32 *)(skb_frag_address(last_frag) + - last_frag_sz - ETH_FCS_LEN); - - fcs_p2 = (u8 *)skb_frag_address(last_frag); - bytes_in_prev = ETH_FCS_LEN - last_frag_sz; - - /* Find where the other part of the FCS is - Linear or another frag */ - if (nr_frags == 1) { - fcs_p1 = skb_tail_pointer(skb); - } else { - skb_frag_t *prev_frag = &skb_shinfo(skb)->frags[nr_frags - 2]; - - fcs_p1 = skb_frag_address(prev_frag) + - skb_frag_size(prev_frag); - } - fcs_p1 -= bytes_in_prev; - - memcpy(&fcs_bytes, fcs_p1, bytes_in_prev); - memcpy(((u8 *)&fcs_bytes) + bytes_in_prev, fcs_p2, last_frag_sz); - - return fcs_bytes; + return __get_unaligned_cpu32(fcs_bytes); } static u8 get_ip_proto(struct sk_buff *skb, __be16 proto) @@ -797,8 +769,9 @@ static inline void mlx5e_handle_csum(struct net_device *netdev, network_depth - ETH_HLEN, skb->csum); if (unlikely(netdev->features & NETIF_F_RXFCS)) - skb->csum = csum_add(skb->csum, -(__force __wsum)mlx5e_get_fcs(skb)); + skb->csum = csum_block_add(skb->csum, + (__force __wsum)mlx5e_get_fcs(skb), + skb->len - ETH_FCS_LEN); stats->csum_complete++; return; } -- 2.19.1.568.g152ad8e336-goog
Re: Latest net-next kernel 4.19.0+
On 10/29/2018 11:09 PM, Dimitris Michailidis wrote: > > Indeed this is a bug. I would expect it to produce frequent errors > though as many odd-length > packets would trigger it. Do you have RXFCS? Regardless, how > frequently do you see the problem? > Old kernels (before 88078d98d1bb) were simply resetting ip_summed to CHECKSUM_NONE And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug you fixed. So we now need to also fix mlx5. And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned earlier, plus __get_unaligned_cpu32() as you hinted.
[PATCH net] net/mlx4_en: add a missing include
Abdul Haleem reported a build error on ppc : drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct iphdr` declared inside parameter list [enabled by default] struct iphdr *iph) ^ drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default] drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function get_fixed_ipv4_csum: drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing pointer to incomplete type __u8 ipproto = iph->protocol; ^ Fixes: 55469bc6b577 ("drivers: net: remove inclusion when not needed") Signed-off-by: Eric Dumazet Reported-by: Abdul Haleem --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -43,6 +43,7 @@ #include #include +#include #if IS_ENABLED(CONFIG_IPV6) #include #endif -- 2.19.1.568.g152ad8e336-goog