Re: RFC: Audit Kernel Container IDs
On 2017-09-13 14:33, Carlos O'Donell wrote: > On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > > Containers are a userspace concept. The kernel knows nothing of them. > > I am looking at this RFC from a userspace perspective, particularly from > the loader's point of view and the unshare syscall and the semantics that > arise from the use of it. > > At a high level what you are doing is providing a way to group, without > hierarchy, processes and namespaces. The processes can move between > container's if they have CAP_CONTAINER_ADMIN and can open and write to > a special proc file. > > * With unshare a thread may dissociate part of its execution context and > therefore see a distinct mount namespace. When you say "process" in this > particular RFC do you exclude the fact that a thread might be in a > distinct container from the rest of the threads in the process? > > > The Linux audit system needs a way to be able to track the container > > provenance of events and actions. Audit needs the kernel's help to do > > this. > > * Why does the Linux audit system need to tracker container provenance? - ability to filter unwanted, irrelevant or unimportant messages before they fill queue so important messages don't get lost. This is a certification requirement. - ability to make security claims about containers, require tracking of actions within those containers to ensure compliance with established security policies. - ability to route messages from events to relevant audit daemon instance or host audit daemon instance or both, as required or determined by user-initiated rules > - How does it help to provide better audit messages? > > - Is it be enough to list the namespace that a process occupies? We started with that approach back more than 4 years ago and found it helped, but didn't go far enough in terms of quick and inexpensive record filtering and left some doubt about provenance of events in the case of non-user context events (incoming network packets). > * Why does it need the kernel's help? > > - Is there a race condition that is only fixable with kernel support? This was a concern, but relatively minor compared with the other benefits. > - Or is it easier with kernel help but not required? It is much easier and much less expensive. > Providing background on these questions would help clarify the > design requirements. Here are some references that should help provide some background: https://github.com/linux-audit/audit-kernel/issues/32 RFE: add namespace IDs to audit records https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events SPEC Virtualization Manager Guest Lifecycle Events https://lwn.net/Articles/699819/ Audit, namespaces, and containers https://lwn.net/Articles/723561/ Containers as kernel objects (my reply, with references: https://lkml.org/lkml/2017/8/14/15 ) https://bugzilla.redhat.com/show_bug.cgi?id=1045666 audit: add namespace IDs to log records > > Since the concept of a container is entirely a userspace concept, a > > trigger signal from the userspace container orchestration system > > initiates this. This will define a point in time and a set of resources > > associated with a particular container with an audit container ID. > > Please don't use the word 'signal', I suggest 'register' since you are > writing to a filesystem. Ok, that's a very reasonable request. 'signal' has a previous meaning. > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > write of a u64 representing the container ID to a file representing a > > process that will become the first process in a new container. > > This might place restrictions on mount namespaces required to define a > > container, or at least careful checking of namespaces in the kernel to > > verify permissions of the orchestrator so it can't change its own > > container ID. > > A bind mount of nsfs may be necessary in the container orchestrator's > > mntNS. > > > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > > filesystem to have this action permitted. At that time, record the > > child container's user-supplied 64-bit container identifier along with > > What is a "child container?" Containers don't have any hierarchy. Maybe some don't, but that's not likely to last long given the abstraction and nesting of orchestration tools. This must be nestable. > I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents > your continued operation as we have today? Correct. It won't prevent processes that otherwise have permissions today from creating all the namespaces it wishes. > > the child container's first process (which may become the container's > > "init" process) process ID (referenced from the initial PID namespace), > > all namespace IDs (in the form of a nsfs device
Re: [RFC PATCH] can: m_can: Support higher speed CAN-FD bitrates
On Thursday 14 September 2017 03:28 AM, Franklin S Cooper Jr wrote: > > > On 08/18/2017 02:39 PM, Franklin S Cooper Jr wrote: >> During test transmitting using CAN-FD at high bitrates (4 Mbps) only >> resulted in errors. Scoping the signals I noticed that only a single bit >> was being transmitted and with a bit more investigation realized the actual >> MCAN IP would go back to initialization mode automatically. >> >> It appears this issue is due to the MCAN needing to use the Transmitter >> Delay Compensation Mode as defined in the MCAN User's Guide. When this >> mode is used the User's Guide indicates that the Transmitter Delay >> Compensation Offset register should be set. The document mentions that this >> register should be set to (1/dbitrate)/2*(Func Clk Freq). >> >> Additional CAN-CIA's "Bit Time Requirements for CAN FD" document indicates >> that this TDC mode is only needed for data bit rates above 2.5 Mbps. >> Therefore, only enable this mode and only set TDCO when the data bit rate >> is above 2.5 Mbps. >> >> Signed-off-by: Franklin S Cooper Jr>> --- >> I'm pretty surprised that this hasn't been implemented already since >> the primary purpose of CAN-FD is to go beyond 1 Mbps and the MCAN IP >> supports up to 10 Mbps. >> >> So it will be nice to get comments from users of this driver to understand >> if they have been able to use CAN-FD beyond 2.5 Mbps without this patch. >> If they haven't what did they do to get around it if they needed higher >> speeds. >> >> Meanwhile I plan on testing this using a more "realistic" CAN bus to insure >> everything still works at 5 Mbps which is the max speed of my CAN >> transceiver. > > ping. Anyone has any thoughts on this? I added Dong who authored the m_can driver and Wenyou who added the only in-kernel user of the driver for any help. Thanks, Sekhar >> >> drivers/net/can/m_can/m_can.c | 24 +++- >> 1 file changed, 23 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c >> index f4947a7..720e073 100644 >> --- a/drivers/net/can/m_can/m_can.c >> +++ b/drivers/net/can/m_can/m_can.c >> @@ -126,6 +126,12 @@ enum m_can_mram_cfg { >> #define DBTP_DSJW_SHIFT 0 >> #define DBTP_DSJW_MASK (0xf << DBTP_DSJW_SHIFT) >> >> +/* Transmitter Delay Compensation Register (TDCR) */ >> +#define TDCR_TDCO_SHIFT 8 >> +#define TDCR_TDCO_MASK (0x7F << TDCR_TDCO_SHIFT) >> +#define TDCR_TDCF_SHIFT 0 >> +#define TDCR_TDCF_MASK (0x7F << TDCR_TDCO_SHIFT) >> + >> /* Test Register (TEST) */ >> #define TEST_LBCK BIT(4) >> >> @@ -977,6 +983,8 @@ static int m_can_set_bittiming(struct net_device *dev) >> const struct can_bittiming *dbt = >can.data_bittiming; >> u16 brp, sjw, tseg1, tseg2; >> u32 reg_btp; >> +u32 enable_tdc = 0; >> +u32 tdco; >> >> brp = bt->brp - 1; >> sjw = bt->sjw - 1; >> @@ -991,9 +999,23 @@ static int m_can_set_bittiming(struct net_device *dev) >> sjw = dbt->sjw - 1; >> tseg1 = dbt->prop_seg + dbt->phase_seg1 - 1; >> tseg2 = dbt->phase_seg2 - 1; >> + >> +/* TDC is only needed for bitrates beyond 2.5 MBit/s >> + * Specified in the "Bit Time Requirements for CAN FD" document >> + */ >> +if (dbt->bitrate > 250) { >> +enable_tdc = DBTP_TDC; >> +/* Equation based on Bosch's M_CAN User Manual's >> + * Transmitter Delay Compensation Section >> + */ >> +tdco = priv->can.clock.freq / (dbt->bitrate * 2); >> +m_can_write(priv, M_CAN_TDCR, tdco << TDCR_TDCO_SHIFT); >> +} >> + >> reg_btp = (brp << DBTP_DBRP_SHIFT) | (sjw << DBTP_DSJW_SHIFT) | >> (tseg1 << DBTP_DTSEG1_SHIFT) | >> -(tseg2 << DBTP_DTSEG2_SHIFT); >> +(tseg2 << DBTP_DTSEG2_SHIFT) | enable_tdc; >> + >> m_can_write(priv, M_CAN_DBTP, reg_btp); >> } >> >>
[PATCH] staging: irda: Remove typedef struct
This patch remove typedef from a structure with all its ocurrences since using typedefs for structures is discouraged. Issue found using Coccinelle: @r1@ type T; @@ typedef struct { ... } T; @script:python c1@ T2; T << r1.T; @@ if T[-2:] =="_t" or T[-2:] == "_T": coccinelle.T2 = T[:-2]; else: coccinelle.T2 = T; print T, coccinelle.T2 @r2@ type r1.T; identifier c1.T2; @@ -typedef struct + T2 { ... } -T ; @r3@ type r1.T; identifier c1.T2; @@ -T +struct T2 Signed-off-by: Haneen Mohammed--- drivers/staging/irda/include/net/irda/qos.h | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/staging/irda/include/net/irda/qos.h b/drivers/staging/irda/include/net/irda/qos.h index 05a5a24..a0315b5 100644 --- a/drivers/staging/irda/include/net/irda/qos.h +++ b/drivers/staging/irda/include/net/irda/qos.h @@ -58,23 +58,23 @@ #define IR_1600 0x02 /* Quality of Service information */ -typedef struct { +struct qos_value { __u32 value; __u16 bits; /* LSB is first byte, MSB is second byte */ -} qos_value_t; +}; struct qos_info { magic_t magic; - qos_value_t baud_rate; /* IR_11520O | ... */ - qos_value_t max_turn_time; - qos_value_t data_size; - qos_value_t window_size; - qos_value_t additional_bofs; - qos_value_t min_turn_time; - qos_value_t link_disc_time; + struct qos_value baud_rate; /* IR_11520O | ... */ + struct qos_value max_turn_time; + struct qos_value data_size; + struct qos_value window_size; + struct qos_value additional_bofs; + struct qos_value min_turn_time; + struct qos_value link_disc_time; - qos_value_t power; + struct qos_value power; }; extern int sysctl_max_baud_rate; -- 2.7.4
Re: [PATCH 01/10] arch:powerpc: return -ENOMEM on failed allocation
> I think the changelog for this series of conversions > should show that you've validated the change by > inspecting the return call chain at each modified line. > > Also, it seems you've cc'd the same mailing lists for > all of the patches modified by this series. > > It would be better to individually address each patch > in the series only cc'ing the appropriate maintainers > and mailing lists. > > A cover letter would be good too. Point noted. Thanks. - Allen
Re: Regression in throughput between kvm guests over virtual bridge
On 2017年09月14日 00:59, Matthew Rosato wrote: On 09/13/2017 04:13 AM, Jason Wang wrote: On 2017年09月13日 09:16, Jason Wang wrote: On 2017年09月13日 01:56, Matthew Rosato wrote: We are seeing a regression for a subset of workloads across KVM guests over a virtual bridge between host kernel 4.12 and 4.13. Bisecting points to c67df11f "vhost_net: try batch dequing from skb array" In the regressed environment, we are running 4 kvm guests, 2 running as uperf servers and 2 running as uperf clients, all on a single host. They are connected via a virtual bridge. The uperf client profile looks like: So, 1 tcp streaming instance per client. When upgrading the host kernel from 4.12->4.13, we see about a 30% drop in throughput for this scenario. After the bisect, I further verified that reverting c67df11f on 4.13 "fixes" the throughput for this scenario. On the other hand, if we increase the load by upping the number of streaming instances to 50 (nprocs="50") or even 10, we see instead a ~10% increase in throughput when upgrading host from 4.12->4.13. So it may be the issue is specific to "light load" scenarios. I would expect some overhead for the batching, but 30% seems significant... Any thoughts on what might be happening here? Hi, thanks for the bisecting. Will try to see if I can reproduce. Various factors could have impact on stream performance. If possible, could you collect the #pkts and average packet size during the test? And if you guest version is above 4.12, could you please retry with napi_tx=true? Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 - 4.4.0-93-generic specifically). Here's a throughput report (uperf) and #pkts and average packet size (tcpstat) for one of the uperf clients: host 4.12 / guest 4.4: throughput: 29.98Gb/s #pkts=33465571 avg packet size=33755.70 host 4.13 / guest 4.4: throughput: 20.36Gb/s #pkts=21233399 avg packet size=36130.69 I test guest 4.4 on Intel machine, still can reproduce :( I ran the test again using net-next.git as guest kernel, with and without napi_tx=true. napi_tx did not seem to have any significant impact on throughput. However, the guest kernel shift from 4.4->net-next improved things. I can still see a regression between host 4.12 and 4.13, but it's more on the order of 10-15% - another sample: host 4.12 / guest net-next (without napi_tx): throughput: 28.88Gb/s #pkts=31743116 avg packet size=33779.78 host 4.13 / guest net-next (without napi_tx): throughput: 24.34Gb/s #pkts=25532724 avg packet size=35963.20 Thanks for the numbers. I originally suspect batching will lead more pkts but less size, but looks not. The less packets is also a hint that there's delay somewhere. Thanks Unfortunately, I could not reproduce it locally. I'm using net-next.git as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz for both before and after the commit. I use 1 vcpu and 1 queue, and pin vcpu and vhost threads into separate cpu on host manually (in same numa node). The environment is quite a bit different -- I'm running in an LPAR on a z13 (s390x). We've seen the issue in various configurations, the smallest thus far was a host partition w/ 40G and 20 CPUs defined (the numbers above were gathered w/ this configuration). Each guest has 4GB and 4 vcpus. No pinning / affinity configured. Unfortunately, I don't have s390x on hand. Will try to get one. Can you hit this regression constantly and what's you qemu command line Yes, the regression seems consistent. I can try tweaking some of the host and guest definitions to see if it makes a difference. Is the issue gone if you reduce VHOST_RX_BATCH to 1? And it would be also helpful to collect perf diff to see if anything interesting. (Consider 4.4 shows more obvious regression, please use 4.4). The guests are instantiated from libvirt - Here's one of the resulting qemu command lines: /usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes -machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -drive file=/dev/disk/by-id/scsi-3600507630bffc0381803,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-ccw,scsi=off,devno=fe.0.,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001 -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device
Re: [PATCH RFC v1 0/3] Support for tap user-space access with veth interfaces
On 2017年09月07日 08:34, sainath.gran...@intel.com wrote: From: Sainath GrandhiThis patchset adds a tap device driver for veth virtual network interface. With this implementation, tap character interface can be added only to the peer veth interface. Adding tap interface to veth is for usecases that forwards packets between host and VMs. This eliminates the need for an additional software bridge. This can be extended to create both the peer interfaces as tap interfaces. These patches are a step in that direction. Sainath Grandhi (3): net: Adding API to parse IFLA_LINKINFO attribute net: Abstracting out common routines from veth for use by vethtap vethtap: veth based tap driver drivers/net/Kconfig | 1 + drivers/net/Makefile| 2 + drivers/net/{veth.c => veth_main.c} | 80 ++--- drivers/net/vethtap.c | 216 include/linux/if_veth.h | 13 +++ include/net/rtnetlink.h | 3 + net/core/rtnetlink.c| 8 ++ 7 files changed, 308 insertions(+), 15 deletions(-) rename drivers/net/{veth.c => veth_main.c} (89%) create mode 100644 drivers/net/vethtap.c create mode 100644 include/linux/if_veth.h Interesting, plan to add vhost support for this? And we can enable zerocopy without any worries I think. Thanks
[PATCH net] tcp: update skb->skb_mstamp more carefully
From: Eric Dumazetliujian reported a problem in TCP_USER_TIMEOUT processing with a patch in tcp_probe_timer() : https://www.spinics.net/lists/netdev/msg454496.html After investigations, the root cause of the problem is that we update skb->skb_mstamp of skbs in write queue, even if the attempt to send a clone or copy of it failed. One reason being a routing problem. This patch prevents this, solving liujian issue. It also removes a potential RTT miscalculation, since __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has been changed. A future ACK would then lead to a very small RTT sample and min_rtt would then be lowered to this too small value. Tested: # cat user_timeout.pkt --local_ip=192.168.102.64 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0` +0 < S 0:0(0) win 0 +0 > S. 0:0(0) ack 1 +.1 < . 1:1(0) ack 1 win 65530 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0 +0 write(4, ..., 24) = 24 +0 > P. 1:25(24) ack 1 win 29200 +.1 < . 1:1(0) ack 25 win 65530 //change the ipaddress +1 `ifconfig tun0 192.168.0.10/16` +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +0 `ifconfig tun0 192.168.102.64/16` +0 < . 1:2(1) ack 25 win 65530 +0 `ifconfig tun0 192.168.0.10/16` +3 write(4, ..., 24) = -1 # ./packetdrill user_timeout.pkt Signed-off-by: Eric Dumazet Reported-by: liujian --- net/ipv4/tcp_output.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5b6690d05abb98884adfa1693f97d896dd202893..a85a8c2948e54b931f8cd956aa7938f7efd355bd 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -991,6 +991,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, struct tcp_skb_cb *tcb; struct tcp_out_options opts; unsigned int tcp_options_size, tcp_header_size; + struct sk_buff *oskb = NULL; struct tcp_md5sig_key *md5; struct tcphdr *th; int err; @@ -998,12 +999,12 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, BUG_ON(!skb || !tcp_skb_pcount(skb)); tp = tcp_sk(sk); - skb->skb_mstamp = tp->tcp_mstamp; if (clone_it) { TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq - tp->snd_una; tcp_rate_skb_sent(sk, skb); + oskb = skb; if (unlikely(skb_cloned(skb))) skb = pskb_copy(skb, gfp_mask); else @@ -1011,6 +1012,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, if (unlikely(!skb)) return -ENOBUFS; } + skb->skb_mstamp = tp->tcp_mstamp; inet = inet_sk(sk); tcb = TCP_SKB_CB(skb); @@ -1122,12 +1124,14 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, err = icsk->icsk_af_ops->queue_xmit(sk, skb, >cork.fl); - if (likely(err <= 0)) - return err; - - tcp_enter_cwr(sk); + if (unlikely(err > 0)) { + tcp_enter_cwr(sk); + err = net_xmit_eval(err); + } + if (!err && oskb) + oskb->skb_mstamp = tp->tcp_mstamp; - return net_xmit_eval(err); + return err; } /* This routine just queues the buffer for sending. @@ -2869,10 +2873,11 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs) skb_headroom(skb) >= 0x)) { struct sk_buff *nskb; - skb->skb_mstamp = tp->tcp_mstamp; nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC); err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) : -ENOBUFS; + if (!err) + skb->skb_mstamp = tp->tcp_mstamp; } else { err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); }
[PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
If fanout_add is preempted after running po-> fanout = match and before running __fanout_link, it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink so, we need add mutex_lock(_mutex) to __unregister_prot_hook or add spin_lock(>bind_lock) before po-> fanout = match test on linux 4.1.12: ./trinity -c setsockopt -C 2 -X & BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()! Kernel panic - not syncing: BUG! CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW O4.1.12 #1 Hardware name: Hisilicon PhosphorHi1382 FPGA (DT) Call trace: [] dump_backtrace+0x0/0xf8 [] show_stack+0x20/0x28 [] dump_stack+0xac/0xe4 [] panic+0xf8/0x268 [] __unregister_prot_hook+0xa0/0x144 [] packet_set_ring+0x280/0x5b4 [] packet_setsockopt+0x320/0x950 [] SyS_setsockopt+0xa4/0xd4 Signed-off-by: nixiaomingTested-by: wudesheng --- net/packet/af_packet.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 008a45c..0300146 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, bool sync) po->running = 0; + mutex_lock(_mutex); if (po->fanout) __fanout_unlink(sk, po); else __dev_remove_pack(>prot_hook); + mutex_unlock(_mutex); __sock_put(sk); -- 2.11.0.1
[PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
If fanout_add is preempted after running po-> fanout = match and before running __fanout_link, it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink so, we need add mutex_lock(_mutex) to __unregister_prot_hook or add spin_lock(>bind_lock) before po-> fanout = match test on linux 4.1.42: ./trinity -c setsockopt -C 2 -X & BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()! Kernel panic - not syncing: BUG! CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW O4.1.12 #1 Hardware name: Hisilicon PhosphorHi1382 FPGA (DT) Call trace: [] dump_backtrace+0x0/0xf8 [] show_stack+0x20/0x28 [] dump_stack+0xac/0xe4 [] panic+0xf8/0x268 [] __unregister_prot_hook+0xa0/0x144 [] packet_set_ring+0x280/0x5b4 [] packet_setsockopt+0x320/0x950 [] SyS_setsockopt+0xa4/0xd4 Signed-off-by: nixiaomingTested-by: wudesheng --- net/packet/af_packet.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 008a45c..0300146 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, bool sync) po->running = 0; + mutex_lock(_mutex); if (po->fanout) __fanout_unlink(sk, po); else __dev_remove_pack(>prot_hook); + mutex_unlock(_mutex); __sock_put(sk); -- 2.11.0.1
RE: [Intel-wired-lan] [PATCH] igb: check memory allocation failure
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On Behalf > Of Christophe JAILLET > Sent: Monday, August 28, 2017 10:13 AM > To: Waskiewicz Jr, Peter; Kirsher, Jeffrey T > > Cc: netdev@vger.kernel.org; kernel-janit...@vger.kernel.org; intel-wired- > l...@lists.osuosl.org; linux-ker...@vger.kernel.org > Subject: Re: [Intel-wired-lan] [PATCH] igb: check memory allocation failure > > Le 28/08/2017 à 01:09, Waskiewicz Jr, Peter a écrit : > > On 8/27/17 2:42 AM, Christophe JAILLET wrote: > >> Check memory allocation failures and return -ENOMEM in such cases, as > >> already done for other memory allocations in this function. > >> > >> This avoids NULL pointers dereference. > >> > >> Signed-off-by: Christophe JAILLET > >> --- > >>drivers/net/ethernet/intel/igb/igb_main.c | 2 ++ > >>1 file changed, 2 insertions(+) > >> This seems to be fine from a "it does not break in testing" perspective, so... Tested-by: Aaron Brown > -PJ > > > Hi, > > in fact, there is no leak because the only caller of 'igb_sw_init()' > (i.e. 'igb_probe()'), already frees these resources in case of error, > see [1] > > These resources are also freed in 'igb_remove()'. > > Best reagrds, > CJ > > [1]: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux- > next.git/tree/drivers/net/ethernet/intel/igb/igb_main.c#n2775 But is PJ's comment saying that it is not really necessary? If so I tend to lean towards the don't touch it if it's not broken perspective.
[PATCH net] udpv6: Fix the checksum computation when HW checksum does not apply
While trying an ESP transport mode encryption for UDPv6 packets of datagram size 1436 with MTU 1500, checksum error was observed in the secondary fragment. This error occurs due to the UDP payload checksum being missed out when computing the full checksum for these packets in udp6_hwcsum_outgoing(). Fixes: d39d938c8228 ("ipv6: Introduce udpv6_send_skb()") Signed-off-by: Subash Abhinov Kasiviswanathan--- net/ipv6/udp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index e2ecfb1..40d7234 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1015,6 +1015,7 @@ static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb, */ offset = skb_transport_offset(skb); skb->csum = skb_checksum(skb, offset, skb->len - offset, 0); + csum = skb->csum; skb->ip_summed = CHECKSUM_NONE; -- 1.9.1
Re: [RFC net-next 2/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc
Hi Henrik, Henrik Austadwrites: > On Thu, Aug 31, 2017 at 06:26:22PM -0700, Vinicius Costa Gomes wrote: >> This queueing discipline implements the shaper algorithm defined by >> the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L. >> >> It's primary usage is to apply some bandwidth reservation to user >> defined traffic classes, which are mapped to different queues via the >> mqprio qdisc. >> >> Initially, it only supports offloading the traffic shaping work to >> supporting controllers. >> >> Later, when a software implementation is added, the current dependency >> on being installed "under" mqprio can be lifted. >> >> Signed-off-by: Vinicius Costa Gomes >> Signed-off-by: Jesus Sanchez-Palencia > > So, I started testing this in my VM to make sure I didn't do anything silly > and it exploded spectacularly as it used the underlying e1000 driver which > does not have a ndo_setup_tc present. > > I commented inline below, but as a tl;dr > > The cbs_init() would call into cbs_change() that would correctly detect > that ndo_setup_tc is missing and abort. However, at this stage it would try > to desroy the qdisc and in cbs_destroy() there's no such guard leading to a > > BUG: unable to handle kernel NULL pointer dereference at 0010 > > Fixing that, another NULL would be found when the code walks into > qdisc_destroy(q->qdisc). > > Apart from this, it loaded fine after some wrestling with tc :) > > Awesome! :D > >> --- >> include/linux/netdevice.h | 1 + >> net/sched/Kconfig | 11 ++ >> net/sched/Makefile| 1 + >> net/sched/sch_cbs.c | 286 >> ++ >> 4 files changed, 299 insertions(+) >> create mode 100644 net/sched/sch_cbs.c >> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> index 35de8312e0b5..dd9a2ecd0c03 100644 >> --- a/include/linux/netdevice.h >> +++ b/include/linux/netdevice.h >> @@ -775,6 +775,7 @@ enum tc_setup_type { >> TC_SETUP_CLSFLOWER, >> TC_SETUP_CLSMATCHALL, >> TC_SETUP_CLSBPF, >> +TC_SETUP_CBS, >> }; >> >> /* These structures hold the attributes of xdp state that are being passed >> diff --git a/net/sched/Kconfig b/net/sched/Kconfig >> index e70ed26485a2..c03d86a7775e 100644 >> --- a/net/sched/Kconfig >> +++ b/net/sched/Kconfig >> @@ -172,6 +172,17 @@ config NET_SCH_TBF >>To compile this code as a module, choose M here: the >>module will be called sch_tbf. >> >> +config NET_SCH_CBS > > Shouldn't this depend on NET_SCH_MQPRIO as it is supposed to hook into > that? Right now, the CBS shaper only makes sense with mqprio, later it may use some other qdisc (like "taprio" mentioned in the cover letter) to hook into, so, yes, you are right, there's a dependency here, better make it clear. Will fix. > > @@ -173,6 +173,7 @@ config NET_SCH_TBF > module will be called sch_tbf. > > config NET_SCH_CBS > + depends on NET_SCH_MQPRIO > tristate "Credit Based Shaper (CBS)" > ---help--- > Say Y here if you want to use the Credit Based Shaper (CBS) packet > >> +tristate "Credit Based Shaper (CBS)" >> +---help--- >> + Say Y here if you want to use the Credit Based Shaper (CBS) packet >> + scheduling algorithm. >> + >> + See the top of for more details. >> + >> + To compile this code as a module, choose M here: the >> + module will be called sch_cbs. >> + >> config NET_SCH_GRED >> tristate "Generic Random Early Detection (GRED)" >> ---help--- >> diff --git a/net/sched/Makefile b/net/sched/Makefile >> index 7b915d226de7..80c8f92d162d 100644 >> --- a/net/sched/Makefile >> +++ b/net/sched/Makefile >> @@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL) += sch_fq_codel.o >> obj-$(CONFIG_NET_SCH_FQ)+= sch_fq.o >> obj-$(CONFIG_NET_SCH_HHF) += sch_hhf.o >> obj-$(CONFIG_NET_SCH_PIE) += sch_pie.o >> +obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o >> >> obj-$(CONFIG_NET_CLS_U32) += cls_u32.o >> obj-$(CONFIG_NET_CLS_ROUTE4)+= cls_route.o >> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c >> new file mode 100644 >> index ..1c86a9e14150 >> --- /dev/null >> +++ b/net/sched/sch_cbs.c >> @@ -0,0 +1,286 @@ >> +/* >> + * net/sched/sch_cbs.c Credit Based Shaper >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public License >> + * as published by the Free Software Foundation; either version >> + * 2 of the License, or (at your option) any later version. >> + * >> + * Authors: Vininicius Costa Gomes >> + * >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +struct cbs_sched_data { >> +struct Qdisc *qdisc; /* Inner qdisc, default - pfifo
[PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB
Add a new compatible string for the R8A77995 (R-Car D3) RAVB. Acked-by: Geert UytterhoevenSigned-off-by: Yoshihiro Shimoda --- Changes from v1: - Based on r8a77970 patch. https://marc.info/?l=linux-renesas-soc=150524655515476 - Add Acked-by (Thanks Geert-san!). Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 2689211..c902261 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -18,6 +18,7 @@ Required properties: - "renesas,etheravb-r8a7795" for the R8A7795 SoC. - "renesas,etheravb-r8a7796" for the R8A7796 SoC. - "renesas,etheravb-r8a77970" for the R8A77970 SoC. + - "renesas,etheravb-r8a77995" for the R8A77995 SoC. - "renesas,etheravb-rcar-gen3" as a fallback for the above R-Car Gen3 devices. -- 1.9.1
[PATCH] net: ipv4: fix l3slave check for index returned in IP_PKTINFO
rt_iif is only set to the actual egress device for the output path. The recent change to consider the l3slave flag when returning IP_PKTINFO works for local traffic (the correct device index is returned), but it broke the more typical use case of packets received from a remote host always returning the VRF index rather than the original ingress device. Update the fixup to consider l3slave and rt_iif actually getting set. Fixes: 1dfa76390bf05 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO") Signed-off-by: David Ahern--- net/ipv4/ip_sockglue.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index e558e4f9597b..a599aa83fdad 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -1207,7 +1207,6 @@ static int do_ip_setsockopt(struct sock *sk, int level, void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb) { struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb); - bool l3slave = ipv4_l3mdev_skb(IPCB(skb)->flags); bool prepare = (inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO) || ipv6_sk_rxinfo(sk); @@ -1221,8 +1220,13 @@ void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb) * (e.g., process binds socket to eth0 for Tx which is * redirected to loopback in the rtable/dst). */ - if (pktinfo->ipi_ifindex == LOOPBACK_IFINDEX || l3slave) + struct rtable *rt = skb_rtable(skb); + bool l3slave = ipv4_l3mdev_skb(IPCB(skb)->flags); + + if (pktinfo->ipi_ifindex == LOOPBACK_IFINDEX) pktinfo->ipi_ifindex = inet_iif(skb); + else if (l3slave && rt && rt->rt_iif) + pktinfo->ipi_ifindex = rt->rt_iif; pktinfo->ipi_spec_dst.s_addr = fib_compute_spec_dst(skb); } else { -- 2.1.4
RE: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB
Hello! > From: Sergei Shtylyov > Sent: Thursday, September 14, 2017 1:02 AM > > Hello! > > On 09/13/2017 03:17 PM, Yoshihiro Shimoda wrote: > > > Add a new compatible string for the R8A77995 (R-Car D3) RAVB. > > > > Signed-off-by: Yoshihiro Shimoda> > --- > > Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt > b/Documentation/devicetree/bindings/net/renesas,ravb.txt > > index 1672353..ae2213f 100644 > > --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt > > +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt > > @@ -17,6 +17,7 @@ Required properties: > > > > - "renesas,etheravb-r8a7795" for the R8A7795 SoC. > > - "renesas,etheravb-r8a7796" for the R8A7796 SoC. > > + - "renesas,etheravb-r8a77995" for the R8A77995 SoC. > > That would conflict with the R8A77970 bindings patch posted by me > yesterday. Please respin atop of it. Yes, I submitted v2 patch now. > > - "renesas,etheravb-rcar-gen3" as a fallback for the above > > R-Car Gen3 devices. > > > > One more fragment is needed here, about the mandatory 'interrupt-names" > prop. My patch makes this change unnecessary, however... I understood it. Best regards, Yoshihiro Shimoda > MBR, Sergei
Re: [PATCH v2 net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
From: Dan CarpenterDate: Thu, 14 Sep 2017 02:00:54 +0300 > This code causes a static checker warning because Smatch doesn't trust > anything that comes from skb->data. I've reviewed this code and I do > think skb->data can be controlled by the user here. > > The sctp_event_subscribe struct has 13 __u8 fields and we want to see > if ours is non-zero. sn_type can be any value in the 0-USHRT_MAX range. > We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read > either before the start of the struct or after the end. > > This is a very old bug and it's surprising that it would go undetected > for so long but my theory is that it just doesn't have a big impact so > it would be hard to notice. > > Signed-off-by: Dan Carpenter > --- > v2: Use reverse-christmas-tree local variable ordering. Applied and queued up for -stable, thanks.
Job Offer In USA
Valero Energy Oil & Gas Company USA Employment Opportunity in Texas, USA. Oil & Gas Sector ( NEW PLANT RIG ) Salary Range $7,000 USD - $11,700 Per Month Kindly send your CV/RESUME for more details regarding our Job Program, Employment offers for all Offices. CC: Forward All Documents/Certificates & Resume to: valeroenergy.recr...@gmail.com
Re: [PATCH v2] net: smsc911x: Quieten netif during suspend
On 09/13/2017 10:42 AM, Geert Uytterhoeven wrote: > If the network interface is kept running during suspend, the net core > may call net_device_ops.ndo_start_xmit() while the Ethernet device is > still suspended, which may lead to a system crash. > > E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is > driven by a PM controlled clock. If the Ethernet registers are accessed > while the clock is not running, the system will crash with an imprecise > external abort. > > As this is a race condition with a small time window, it is not so easy > to trigger at will. Using pm_test may increase your chances: > > # echo 0 > /sys/module/printk/parameters/console_suspend > # echo platform > /sys/power/pm_test > # echo mem > /sys/power/state > > To fix this, make sure the network interface is quietened during > suspend. > > Signed-off-by: Geert UytterhoevenReviewed-by: Florian Fainelli You may want to take the opportunity to suspend the PHY device (conversely resume it) if WoL is not enabled on this device. Thanks! > --- > This is v2 of the series "[PATCH 0/2] net: Fix crashes due to activity > during suspend", which degenerated into a single patch after commit > ebc8254aeae34226 ("Revert "net: phy: Correctly process PHY_HALTED in > phy_stop_machine()"") made "[PATCH 1/2] net: phy: Freeze PHY polling before > suspending devices" no longer needed. > > v2: > - Spelling s/quit/quiet/g. > > No stacktrace is provided, as the imprecise external abort is usually > reported from an innocent looking and unrelated function like > __loop_delay(), cpu_idle_poll(), or arch_timer_read_counter_long(). > --- > drivers/net/ethernet/smsc/smsc911x.c | 15 ++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/smsc/smsc911x.c > b/drivers/net/ethernet/smsc/smsc911x.c > index 0b6a39b003a4e188..012fb66eed8dd618 100644 > --- a/drivers/net/ethernet/smsc/smsc911x.c > +++ b/drivers/net/ethernet/smsc/smsc911x.c > @@ -2595,6 +2595,11 @@ static int smsc911x_suspend(struct device *dev) > struct net_device *ndev = dev_get_drvdata(dev); > struct smsc911x_data *pdata = netdev_priv(ndev); > > + if (netif_running(ndev)) { > + netif_stop_queue(ndev); > + netif_device_detach(ndev); > + } > + > /* enable wake on LAN, energy detection and the external PME >* signal. */ > smsc911x_reg_write(pdata, PMT_CTRL, > @@ -2628,7 +2633,15 @@ static int smsc911x_resume(struct device *dev) > while (!(smsc911x_reg_read(pdata, PMT_CTRL) & PMT_CTRL_READY_) && --to) > udelay(1000); > > - return (to == 0) ? -EIO : 0; > + if (to == 0) > + return -EIO; > + > + if (netif_running(ndev)) { > + netif_device_attach(ndev); > + netif_start_queue(ndev); > + } > + > + return 0; > } > > static const struct dev_pm_ops smsc911x_pm_ops = { > -- Florian
[PATCH v2 net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
This code causes a static checker warning because Smatch doesn't trust anything that comes from skb->data. I've reviewed this code and I do think skb->data can be controlled by the user here. The sctp_event_subscribe struct has 13 __u8 fields and we want to see if ours is non-zero. sn_type can be any value in the 0-USHRT_MAX range. We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read either before the start of the struct or after the end. This is a very old bug and it's surprising that it would go undetected for so long but my theory is that it just doesn't have a big impact so it would be hard to notice. Signed-off-by: Dan Carpenter--- v2: Use reverse-christmas-tree local variable ordering. diff --git a/include/net/sctp/ulpevent.h b/include/net/sctp/ulpevent.h index 1060494ac230..b8c86ec1a8f5 100644 --- a/include/net/sctp/ulpevent.h +++ b/include/net/sctp/ulpevent.h @@ -153,8 +153,12 @@ __u16 sctp_ulpevent_get_notification_type(const struct sctp_ulpevent *event); static inline int sctp_ulpevent_type_enabled(__u16 sn_type, struct sctp_event_subscribe *mask) { + int offset = sn_type - SCTP_SN_TYPE_BASE; char *amask = (char *) mask; - return amask[sn_type - SCTP_SN_TYPE_BASE]; + + if (offset >= sizeof(struct sctp_event_subscribe)) + return 0; + return amask[offset]; } /* Given an event subscription, is this event enabled? */
Re: [PATCH] net: Convert int functions to bool
On Thu, 2017-09-14 at 01:31 +0300, Alexey Dobriyan wrote: > > Global function ipv6_rcv_saddr_equal and static functions > > ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int. > > > > bool is slightly more descriptive for these functions so change > > their return type from int to bool. > > From code generation POV "int" is better for non-inlined functions > especially on non-x86. I don't agree with you here. For instance: $ size net/ipv4/*.o* text data bss dec hex filename 9832 26 0 9858 2682 net/ipv4/inet_connection_sock.o.arm.defconfig.new 9860 26 0 9886 269e net/ipv4/inet_connection_sock.o.arm.defconfig.old
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
On 09/13/2017 03:44 PM, Josef Bacik wrote: > Alright thanks, this should fix it. > Still no luck with all three patches applied to fedora 4.12.8-300 RPM. Pretty sure I didn't mess up the testing but since I rarely do kernel builds it's not impossible... Thanks, Cole
Re: [PATCH] net: Convert int functions to bool
> Global function ipv6_rcv_saddr_equal and static functions > ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int. > > bool is slightly more descriptive for these functions so change > their return type from int to bool. >From code generation POV "int" is better for non-inlined functions especially on non-x86. Patch bloats even x86_64: $ ./scripts/bloat-o-meter ../vmlinux-002-do-while ../obj/vmlinux add/remove: 0/0 grow/shrink: 2/0 up/down: 3/0 (3) function old new delta inet_rcv_saddr_equal 38 40 +2 inet_csk_get_port 13001301 +1 Total: Before=6084708, After=6084711, chg +0.00%
Re: [RFC PATCH] can: m_can: Support higher speed CAN-FD bitrates
On 08/18/2017 02:39 PM, Franklin S Cooper Jr wrote: > During test transmitting using CAN-FD at high bitrates (4 Mbps) only > resulted in errors. Scoping the signals I noticed that only a single bit > was being transmitted and with a bit more investigation realized the actual > MCAN IP would go back to initialization mode automatically. > > It appears this issue is due to the MCAN needing to use the Transmitter > Delay Compensation Mode as defined in the MCAN User's Guide. When this > mode is used the User's Guide indicates that the Transmitter Delay > Compensation Offset register should be set. The document mentions that this > register should be set to (1/dbitrate)/2*(Func Clk Freq). > > Additional CAN-CIA's "Bit Time Requirements for CAN FD" document indicates > that this TDC mode is only needed for data bit rates above 2.5 Mbps. > Therefore, only enable this mode and only set TDCO when the data bit rate > is above 2.5 Mbps. > > Signed-off-by: Franklin S Cooper Jr> --- > I'm pretty surprised that this hasn't been implemented already since > the primary purpose of CAN-FD is to go beyond 1 Mbps and the MCAN IP > supports up to 10 Mbps. > > So it will be nice to get comments from users of this driver to understand > if they have been able to use CAN-FD beyond 2.5 Mbps without this patch. > If they haven't what did they do to get around it if they needed higher > speeds. > > Meanwhile I plan on testing this using a more "realistic" CAN bus to insure > everything still works at 5 Mbps which is the max speed of my CAN > transceiver. ping. Anyone has any thoughts on this? > > drivers/net/can/m_can/m_can.c | 24 +++- > 1 file changed, 23 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/can/m_can/m_can.c b/drivers/net/can/m_can/m_can.c > index f4947a7..720e073 100644 > --- a/drivers/net/can/m_can/m_can.c > +++ b/drivers/net/can/m_can/m_can.c > @@ -126,6 +126,12 @@ enum m_can_mram_cfg { > #define DBTP_DSJW_SHIFT 0 > #define DBTP_DSJW_MASK (0xf << DBTP_DSJW_SHIFT) > > +/* Transmitter Delay Compensation Register (TDCR) */ > +#define TDCR_TDCO_SHIFT 8 > +#define TDCR_TDCO_MASK (0x7F << TDCR_TDCO_SHIFT) > +#define TDCR_TDCF_SHIFT 0 > +#define TDCR_TDCF_MASK (0x7F << TDCR_TDCO_SHIFT) > + > /* Test Register (TEST) */ > #define TEST_LBCKBIT(4) > > @@ -977,6 +983,8 @@ static int m_can_set_bittiming(struct net_device *dev) > const struct can_bittiming *dbt = >can.data_bittiming; > u16 brp, sjw, tseg1, tseg2; > u32 reg_btp; > + u32 enable_tdc = 0; > + u32 tdco; > > brp = bt->brp - 1; > sjw = bt->sjw - 1; > @@ -991,9 +999,23 @@ static int m_can_set_bittiming(struct net_device *dev) > sjw = dbt->sjw - 1; > tseg1 = dbt->prop_seg + dbt->phase_seg1 - 1; > tseg2 = dbt->phase_seg2 - 1; > + > + /* TDC is only needed for bitrates beyond 2.5 MBit/s > + * Specified in the "Bit Time Requirements for CAN FD" document > + */ > + if (dbt->bitrate > 250) { > + enable_tdc = DBTP_TDC; > + /* Equation based on Bosch's M_CAN User Manual's > + * Transmitter Delay Compensation Section > + */ > + tdco = priv->can.clock.freq / (dbt->bitrate * 2); > + m_can_write(priv, M_CAN_TDCR, tdco << TDCR_TDCO_SHIFT); > + } > + > reg_btp = (brp << DBTP_DBRP_SHIFT) | (sjw << DBTP_DSJW_SHIFT) | > (tseg1 << DBTP_DTSEG1_SHIFT) | > - (tseg2 << DBTP_DTSEG2_SHIFT); > + (tseg2 << DBTP_DTSEG2_SHIFT) | enable_tdc; > + > m_can_write(priv, M_CAN_DBTP, reg_btp); > } > >
Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
From: Jiri PirkoDate: Wed, 13 Sep 2017 22:50:06 +0200 > Wed, Sep 13, 2017 at 06:34:28PM CEST, da...@davemloft.net wrote: >>From: Jiri Pirko >>Date: Wed, 13 Sep 2017 17:32:37 +0200 >> >>> From: Jiri Pirko >>> >>> Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed >>> freeing in call_rcu, which changed already existing hard-to-hit >>> race condition into 100% hit: >>> >>> [ 598.599825] BUG: unable to handle kernel NULL pointer dereference at >>> 0030 >>> [ 598.607782] IP: tcf_action_destroy+0xc0/0x140 >>> >>> Or: >>> >>> [ 40.858924] BUG: unable to handle kernel NULL pointer dereference at >>> 0030 >>> [ 40.862840] IP: tcf_generic_walker+0x534/0x820 >>> >>> Fix this by storing the ops and use them directly for module_put call. >>> >>> Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") >>> Signed-off-by: Jiri Pirko >> >>Applied, thanks Jiri. > > Oh, I forgot to mention, this would be nice to push to stable. Ok, queued up.
[PATCH] net: Convert int functions to bool
Global function ipv6_rcv_saddr_equal and static functions ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int. bool is slightly more descriptive for these functions so change their return type from int to bool. Signed-off-by: Joe Perches--- include/net/addrconf.h | 4 ++-- net/ipv4/inet_connection_sock.c | 36 ++-- 2 files changed, 20 insertions(+), 20 deletions(-) diff --git a/include/net/addrconf.h b/include/net/addrconf.h index f44ff2476758..87981cd63180 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -94,8 +94,8 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr *addr, u32 banned_flags); int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, u32 banned_flags); -int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, -bool match_wildcard); +bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, + bool match_wildcard); void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr); void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4089c013cb03..054b26295638 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -39,11 +39,11 @@ EXPORT_SYMBOL(inet_csk_timer_bug_msg); * IPV6_ADDR_ANY only equals to IPV6_ADDR_ANY, * and 0.0.0.0 equals to 0.0.0.0 only */ -static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, - const struct in6_addr *sk2_rcv_saddr6, - __be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, - bool sk1_ipv6only, bool sk2_ipv6only, - bool match_wildcard) +static bool ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, +const struct in6_addr *sk2_rcv_saddr6, +__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, +bool sk1_ipv6only, bool sk2_ipv6only, +bool match_wildcard) { int addr_type = ipv6_addr_type(sk1_rcv_saddr6); int addr_type2 = sk2_rcv_saddr6 ? ipv6_addr_type(sk2_rcv_saddr6) : IPV6_ADDR_MAPPED; @@ -52,29 +52,29 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, if (addr_type == IPV6_ADDR_MAPPED && addr_type2 == IPV6_ADDR_MAPPED) { if (!sk2_ipv6only) { if (sk1_rcv_saddr == sk2_rcv_saddr) - return 1; + return true; if (!sk1_rcv_saddr || !sk2_rcv_saddr) return match_wildcard; } - return 0; + return false; } if (addr_type == IPV6_ADDR_ANY && addr_type2 == IPV6_ADDR_ANY) - return 1; + return true; if (addr_type2 == IPV6_ADDR_ANY && match_wildcard && !(sk2_ipv6only && addr_type == IPV6_ADDR_MAPPED)) - return 1; + return true; if (addr_type == IPV6_ADDR_ANY && match_wildcard && !(sk1_ipv6only && addr_type2 == IPV6_ADDR_MAPPED)) - return 1; + return true; if (sk2_rcv_saddr6 && ipv6_addr_equal(sk1_rcv_saddr6, sk2_rcv_saddr6)) - return 1; + return true; - return 0; + return false; } #endif @@ -82,20 +82,20 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, * match_wildcard == false: addresses must be exactly the same, i.e. * 0.0.0.0 only equals to 0.0.0.0 */ -static int ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, - bool sk2_ipv6only, bool match_wildcard) +static bool ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, +bool sk2_ipv6only, bool match_wildcard) { if (!sk2_ipv6only) { if (sk1_rcv_saddr == sk2_rcv_saddr) - return 1; + return true; if (!sk1_rcv_saddr || !sk2_rcv_saddr) return match_wildcard; } - return 0; + return false; } -int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, -bool match_wildcard) +bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, + bool match_wildcard) { #if IS_ENABLED(CONFIG_IPV6) if (sk->sk_family == AF_INET6) -- 2.10.0.rc2.1.g053435c
Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
Wed, Sep 13, 2017 at 06:34:28PM CEST, da...@davemloft.net wrote: >From: Jiri Pirko>Date: Wed, 13 Sep 2017 17:32:37 +0200 > >> From: Jiri Pirko >> >> Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed >> freeing in call_rcu, which changed already existing hard-to-hit >> race condition into 100% hit: >> >> [ 598.599825] BUG: unable to handle kernel NULL pointer dereference at >> 0030 >> [ 598.607782] IP: tcf_action_destroy+0xc0/0x140 >> >> Or: >> >> [ 40.858924] BUG: unable to handle kernel NULL pointer dereference at >> 0030 >> [ 40.862840] IP: tcf_generic_walker+0x534/0x820 >> >> Fix this by storing the ops and use them directly for module_put call. >> >> Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") >> Signed-off-by: Jiri Pirko > >Applied, thanks Jiri. Oh, I forgot to mention, this would be nice to push to stable.
Re: [PATCH] MAINTAINERS: review Renesas DT bindings as well
From: Sergei ShtylyovDate: Wed, 13 Sep 2017 22:28:53 +0300 > When adding myself as a reviewer for the Renesas Ethernet drivers > I somehow forgot about the bindings -- I want to review them as well. > > Fixes: 8e6569af3a1b ("MAINTAINERS: add myself as Renesas Ethernet drivers > reviewer") > Signed-off-by: Sergei Shtylyov Applied.
Re: [PATCH net] net_sched: gen_estimator: fix scaling error in bytes/packets samples
From: Eric DumazetDate: Wed, 13 Sep 2017 11:16:45 -0700 > From: Eric Dumazet > > Denys reported wrong rate estimations with HTB classes. > > It appears the bug was added in linux-4.10, since my tests > where using intervals of one second only. > > HTB using 4 sec default rate estimators, reported rates > were 4x higher. > > We need to properly scale the bytes/packets samples before > integrating them in EWMA. > > Tested: ... > Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate > estimators") > Signed-off-by: Eric Dumazet > Reported-by: Denys Fedoryshchenko Applied and queued up for -stable, thanks Eric.
Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode
On Wed, Sep 13, 2017 at 4:15 AM, 严海双wrote: > > >> On 2017年9月13日, at 上午7:43, Pravin Shelar wrote: >> >> On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan >> wrote: >>> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata >>> mode, tos should also fallback to ip{4,6}_dst_hoplimit. >>> >>> Signed-off-by: Haishuang Yan >>> >>> --- >>> Changes since v2: >>> * Make the commit message more clearer. >>> --- >>> drivers/net/geneve.c | 6 ++ >>> 1 file changed, 2 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c >>> index f640407..d52a65f 100644 >>> --- a/drivers/net/geneve.c >>> +++ b/drivers/net/geneve.c >>> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, >>> struct net_device *dev, >>>sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); >>>if (geneve->collect_md) { >>>tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb); >>> - ttl = key->ttl; >>>} else { >>>tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb); >>> - ttl = key->ttl ? : ip4_dst_hoplimit(>dst); >>>} >>> + ttl = key->ttl ? : ip4_dst_hoplimit(>dst); >>>df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0; >>> >> This changes user API of Geneve collect-metadata mode. I do not see >> good reason for this. Why user can not set right TTL for the flow? >> > > For example, I test this case with script samples/bpf/test_tunnel_bpf.sh, > and modify samples/bpf/tcbpf2_kern.c with following patch: > But user is suppose to specify the TTL in collect-md mode for Geneve tunnel. That is the API.
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
> On Sep 13, 2017, at 12:46 PM, Chuck Ebbertwrote: > > On Wed, 13 Sep 2017 17:28:25 + > Josef Bacik wrote: > >> Sorry I thought I had made this other fix, can you apply this on top >> of the other one and try that? I have more things to try if this >> doesn’t work, sorry you are playing go between, but I want to make >> sure I know _which_ fix actually fixes the problem, and then clean up >> in followup patches. Thanks, >> >> Josef >> >> On 9/13/17, 8:45 AM, "Laura Abbott" wrote: >> >> On 09/12/2017 04:12 PM, Josef Bacik wrote: >>> First I’m super sorry for the top post, I’m at plumbers and I >>> forgot to upload my muttrc to my new cloud instance, so I’m screwed >>> using outlook. >>> >>> I have a completely untested, uncompiled patch that I think will >>> fix the problem, would you mind giving it a go? Thanks, >>> >>> Josef >> >> Thanks for the quick turnaround. Unfortunately, the problem is still >> reproducible according to the reporter. >> >> Thanks, >> Laura > > I am confused by the patch that originally caused this: > >if (sk->sk_family == AF_INET6) >return ipv6_rcv_saddr_equal(>sk_v6_rcv_saddr, > - >sk_v6_rcv_saddr, > + inet6_rcv_saddr(sk2), >sk->sk_rcv_saddr, >sk2->sk_rcv_saddr, > > Shouldn't the first argument also be changed to use inet6_rcv_saddr()? No we know sk is IPv6 so it's alright to use directly. Thanks, Josef
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
On Wed, 13 Sep 2017 17:28:25 + Josef Bacikwrote: > Sorry I thought I had made this other fix, can you apply this on top > of the other one and try that? I have more things to try if this > doesn’t work, sorry you are playing go between, but I want to make > sure I know _which_ fix actually fixes the problem, and then clean up > in followup patches. Thanks, > > Josef > > On 9/13/17, 8:45 AM, "Laura Abbott" wrote: > > On 09/12/2017 04:12 PM, Josef Bacik wrote: > > First I’m super sorry for the top post, I’m at plumbers and I > > forgot to upload my muttrc to my new cloud instance, so I’m screwed > > using outlook. > > > > I have a completely untested, uncompiled patch that I think will > > fix the problem, would you mind giving it a go? Thanks, > > > > Josef > > Thanks for the quick turnaround. Unfortunately, the problem is still > reproducible according to the reporter. > > Thanks, > Laura I am confused by the patch that originally caused this: if (sk->sk_family == AF_INET6) return ipv6_rcv_saddr_equal(>sk_v6_rcv_saddr, - >sk_v6_rcv_saddr, + inet6_rcv_saddr(sk2), sk->sk_rcv_saddr, sk2->sk_rcv_saddr, Shouldn't the first argument also be changed to use inet6_rcv_saddr()?
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
Alright thanks, this should fix it. Josef On 9/13/17, 12:14 PM, "Cole Robinson"wrote: On 09/13/2017 01:40 PM, Cole Robinson wrote: > On 09/13/2017 01:28 PM, Josef Bacik wrote: >> Sorry I thought I had made this other fix, can you apply this on top of the >> other one and try that? I have more things to try if this doesn’t work, >> sorry you are playing go between, but I want to make sure I know _which_ fix >> actually fixes the problem, and then clean up in followup patches. Thanks, >> > > I'm the bug reporter. I'll combine the two patches and report back > Nope, issue is still present with both patches applied. Tried my own build and a package Laura provided Thanks, Cole 0001-net-don-t-fast-patch-mismatched-sockets-in-STRICT-mo.patch Description: 0001-net-don-t-fast-patch-mismatched-sockets-in-STRICT-mo.patch
Re: RFC: Audit Kernel Container IDs
On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > Containers are a userspace concept. The kernel knows nothing of them. I am looking at this RFC from a userspace perspective, particularly from the loader's point of view and the unshare syscall and the semantics that arise from the use of it. At a high level what you are doing is providing a way to group, without hierarchy, processes and namespaces. The processes can move between container's if they have CAP_CONTAINER_ADMIN and can open and write to a special proc file. * With unshare a thread may dissociate part of its execution context and therefore see a distinct mount namespace. When you say "process" in this particular RFC do you exclude the fact that a thread might be in a distinct container from the rest of the threads in the process? > The Linux audit system needs a way to be able to track the container > provenance of events and actions. Audit needs the kernel's help to do > this. * Why does the Linux audit system need to tracker container provenance? - How does it help to provide better audit messages? - Is it be enough to list the namespace that a process occupies? * Why does it need the kernel's help? - Is there a race condition that is only fixable with kernel support? - Or is it easier with kernel help but not required? Providing background on these questions would help clarify the design requirements. > Since the concept of a container is entirely a userspace concept, a > trigger signal from the userspace container orchestration system > initiates this. This will define a point in time and a set of resources > associated with a particular container with an audit container ID. Please don't use the word 'signal', I suggest 'register' since you are writing to a filesystem. > The trigger is a pseudo filesystem (proc, since PID tree already exists) > write of a u64 representing the container ID to a file representing a > process that will become the first process in a new container. > This might place restrictions on mount namespaces required to define a > container, or at least careful checking of namespaces in the kernel to > verify permissions of the orchestrator so it can't change its own > container ID. > A bind mount of nsfs may be necessary in the container orchestrator's > mntNS. > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > filesystem to have this action permitted. At that time, record the > child container's user-supplied 64-bit container identifier along with What is a "child container?" Containers don't have any hierarchy. I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents your continued operation as we have today? > the child container's first process (which may become the container's > "init" process) process ID (referenced from the initial PID namespace), > all namespace IDs (in the form of a nsfs device number and inode number > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > op=$action field. What kind of requirement is there on the first tid/pid registering the container ID? What if the 8th tid/pid does the registration? Would that mean that the first process of the container did not register? It seems like you are suggesting that the registration by the 8th tid/pid causes a cascading registration progress, registering all tid/pids in the same grouping? Is that true? > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > container ID present on an auditable action or event. > > Forked and cloned processes inherit their parent's container ID, > referenced in the process' audit_context struct. So a cloned process with CLONE_NEWNS has the came container ID as the parent process that called clone, at least until the clone has time to change to a new container ID? Do you forsee any case where someone might need a semantic that is slightly different? For example wanting to set the container ID on clone? > Log the creation of every namespace, inheriting/adding its spawning > process' containerID(s), if applicable. Include the spawning and > spawned namespace IDs (device and inode number tuples). > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > Note: At this point it appears only network namespaces may need to track > container IDs apart from processes since incoming packets may cause an > auditable event before being associated with a process. OK. > Log the destruction of every namespace when it is no longer used by any > process, include the namespace IDs (device and inode number tuples). > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > the parent and child namespace IDs for any changes to a process' > namespaces. [setns(2)] > Note: It may be possible to combine AUDIT_NS_* record formats and > distinguish them with an op=$action field depending on the fields > required for each message type.
[PATCH] MAINTAINERS: review Renesas DT bindings as well
When adding myself as a reviewer for the Renesas Ethernet drivers I somehow forgot about the bindings -- I want to review them as well. Fixes: 8e6569af3a1b ("MAINTAINERS: add myself as Renesas Ethernet drivers reviewer") Signed-off-by: Sergei Shtylyov--- The patch is against DaveM's 'net.git' repo. MAINTAINERS |2 ++ 1 file changed, 2 insertions(+) Index: net/MAINTAINERS === --- net.orig/MAINTAINERS +++ net/MAINTAINERS @@ -11379,6 +11379,8 @@ RENESAS ETHERNET DRIVERS R: Sergei Shtylyov L: netdev@vger.kernel.org L: linux-renesas-...@vger.kernel.org +F: Documentation/devicetree/bindings/net/renesas,*.txt +F: Documentation/devicetree/bindings/net/sh_eth.txt F: drivers/net/ethernet/renesas/ F: include/linux/sh_eth.h
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
On 09/13/2017 01:40 PM, Cole Robinson wrote: > On 09/13/2017 01:28 PM, Josef Bacik wrote: >> Sorry I thought I had made this other fix, can you apply this on top of the >> other one and try that? I have more things to try if this doesn’t work, >> sorry you are playing go between, but I want to make sure I know _which_ fix >> actually fixes the problem, and then clean up in followup patches. Thanks, >> > > I'm the bug reporter. I'll combine the two patches and report back > Nope, issue is still present with both patches applied. Tried my own build and a package Laura provided Thanks, Cole
[PATCH] ath10k: make ath10k_hw_ce_regs const
Make them const as they are not modified in the file referencing them. They are only stored in the const field 'hw_ce_reg' of an ath10k structure. Also, make the declarations in the header const. Signed-off-by: Bhumika Goyal--- drivers/net/wireless/ath/ath10k/hw.c | 4 ++-- drivers/net/wireless/ath/ath10k/hw.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/ath/ath10k/hw.c b/drivers/net/wireless/ath/ath10k/hw.c index a860691..07df7c6 100644 --- a/drivers/net/wireless/ath/ath10k/hw.c +++ b/drivers/net/wireless/ath/ath10k/hw.c @@ -310,7 +310,7 @@ .wm_high= _dst_wm_high, }; -struct ath10k_hw_ce_regs wcn3990_ce_regs = { +const struct ath10k_hw_ce_regs wcn3990_ce_regs = { .sr_base_addr = 0x, .sr_size_addr = 0x0008, .dr_base_addr = 0x000c, @@ -457,7 +457,7 @@ struct ath10k_hw_ce_regs wcn3990_ce_regs = { .wm_high= _dst_wm_high, }; -struct ath10k_hw_ce_regs qcax_ce_regs = { +const struct ath10k_hw_ce_regs qcax_ce_regs = { .sr_base_addr = 0x, .sr_size_addr = 0x0004, .dr_base_addr = 0x0008, diff --git a/drivers/net/wireless/ath/ath10k/hw.h b/drivers/net/wireless/ath/ath10k/hw.h index 0c089f6..f80840f 100644 --- a/drivers/net/wireless/ath/ath10k/hw.h +++ b/drivers/net/wireless/ath/ath10k/hw.h @@ -369,8 +369,8 @@ struct ath10k_hw_values { extern const struct ath10k_hw_values qca9888_values; extern const struct ath10k_hw_values qca4019_values; extern const struct ath10k_hw_values wcn3990_values; -extern struct ath10k_hw_ce_regs wcn3990_ce_regs; -extern struct ath10k_hw_ce_regs qcax_ce_regs; +extern const struct ath10k_hw_ce_regs wcn3990_ce_regs; +extern const struct ath10k_hw_ce_regs qcax_ce_regs; void ath10k_hw_fill_survey_time(struct ath10k *ar, struct survey_info *survey, u32 cc, u32 rcc, u32 cc_prev, u32 rcc_prev); -- 1.9.1
Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels
> On Sep 13, 2017, at 5:19 PM, Michal Hockowrote: > > On Wed 13-09-17 15:07:26, Jorgen S. Hansen wrote: >> >>> On Sep 12, 2017, at 11:08 AM, Michal Hocko wrote: >>> >>> Hi, >>> we are seeing the following splat with Debian 3.16 stable kernel >>> >>> BUG: scheduling while atomic: MATLAB/26771/0x0100 >>> Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver >>> rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache >>> sunrpc vmw_vso$ >>> CPU: 0 PID: 26771 Comm: MATLAB Tainted: G O 3.16.0-4-amd64 #1 >>> Debian 3.16.7-ckt20-1+deb8u3 >>> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference >>> Platform, BIOS 6.00 09/21/2015 >>> 88315c1e4c20 8150db3f 88193f803dc8 8150acdf >>> 815103a2 00012f00 8819423dbfd8 00012f00 >>> 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0 >>> Call Trace: >>> [] ? dump_stack+0x41/0x51 >>> [] ? __schedule_bug+0x48/0x55 >>> [] ? __schedule+0x5d2/0x700 >>> [] ? schedule_timeout+0x229/0x2a0 >>> [] ? select_task_rq_fair+0x390/0x700 >>> [] ? check_preempt_wakeup+0x120/0x1d0 >>> [] ? wait_for_completion+0xa8/0x120 >>> [] ? wake_up_state+0x10/0x10 >>> [] ? call_rcu_bh+0x20/0x20 >>> [] ? wait_rcu_gp+0x4b/0x60 >>> [] ? ftrace_raw_output_rcu_utilization+0x40/0x40 >>> [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci] >>> [] ? vmci_transport_destruct+0x1d/0xe0 >>> [vmw_vsock_vmci_transport] >>> [] ? vsock_sk_destruct+0x13/0x60 [vsock] >>> [] ? __sk_free+0x1a/0x130 >>> [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 >>> [vmw_vsock_vmci_transport] >>> [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 >>> [vmw_vmci] >>> [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci] >>> [] ? tasklet_action+0xf4/0x100 >>> [] ? __do_softirq+0xf1/0x290 >>> [] ? irq_exit+0x95/0xa0 >>> [] ? do_IRQ+0x52/0xe0 >>> [] ? common_interrupt+0x6d/0x6d >>> >>> AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe >>> to call in interrupt context") but this patch hasn't been backported to >>> stable trees. It applies cleanly on top of 3.16 stable tree but I am not >>> familiar with the code to send the backport to the stable maintainer >>> directly. >>> >>> Could you double check that the patch below (just a blind cherry-pick) >>> is correct and it doesn't need additional patches on top? >> >> Hi, >> >> The patch below has been used to fix the above issue by other distros >> - among them Redhat for the 3.10 kernel, so it should work for 3.16 as >> well. > > Thanks for the confirmation. I do not see 4ef7ea9195ea ("VSOCK: sock_put > wasn't safe to call in interrupt context") in 3.10 stable branch > though. > >> In addition to the patch above, there are two other patches that >> need to be applied on top for the fix to be correct: >> >> 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue." >> >> and >> >> 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should >> filter out non matching QPs." > > Good to know. I will send all three patches cherry-picked on top of the > current 3.16 stable branch. Could you have a look please? The patch series look good to me. Thanks for taking care of this, Jorgen
Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY
On Fri, Sep 08, 2017 at 09:43:25AM +0200, Corentin Labbe wrote: > On Fri, Sep 08, 2017 at 09:25:38AM +0200, Maxime Ripard wrote: > > On Fri, Sep 08, 2017 at 09:11:51AM +0200, Corentin Labbe wrote: > > > This patch add documentation about the MDIO switch used on sun8i-h3-emac > > > for integrated PHY. > > > > > > Signed-off-by: Corentin Labbe> > > --- > > > .../devicetree/bindings/net/dwmac-sun8i.txt| 127 > > > +++-- > > > 1 file changed, 120 insertions(+), 7 deletions(-) > > > > > > diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > > b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > > index 725f3b187886..3fa0e54825ea 100644 > > > --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > > +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt > > > @@ -39,7 +39,7 @@ Optional properties for the following compatibles: > > > - allwinner,leds-active-low: EPHY LEDs are active low > > > > > > Required child node of emac: > > > -- mdio bus node: should be named mdio > > > +- mdio bus node: should be labelled mdio > > > > labels do not end up in the final DT (while the names do) so why are > > you making this change? > > > > I misunderstood label/name. > Anyway, this contrainst should leave due to "snps,dwmac-mdio MDIOs are > automatically registered" > > > > > > > Required properties of the mdio node: > > > - #address-cells: shall be 1 > > > @@ -48,14 +48,28 @@ Required properties of the mdio node: > > > The device node referenced by "phy" or "phy-handle" should be a child > > > node > > > of the mdio node. See phy.txt for the generic PHY bindings. > > > > > > -Required properties of the phy node with the following compatibles: > > > +The following compatibles require an mdio-mux node called "mdio-mux": > > > + - "allwinner,sun8i-h3-emac" > > > + - "allwinner,sun8i-v3s-emac": > > > +Required properties for the mdio-mux node: > > > + - compatible = "mdio-mux" > > > + - one child mdio for the integrated mdio > > > + - one child mdio for the external mdio if present (V3s have none) > > > +Required properties for the mdio-mux children node: > > > + - reg: 0 for internal MDIO bus, 1 for external MDIO bus > > > + > > > +The following compatibles require a PHY node representing the integrated > > > +PHY, under the integrated MDIO bus node if an mdio-mux node is used: > > >- "allwinner,sun8i-h3-emac", > > >- "allwinner,sun8i-v3s-emac": > > > + > > > +Required properties of the integrated phy node: > > > - clocks: a phandle to the reference clock for the EPHY > > > - resets: a phandle to the reset control for the EPHY > > > +- phy-is-integrated > > > +- Should be a child of the integrated mdio > > > > I'm not sure what you mean by that, you ask that it should (so not > > required?) be a child of the integrated mdio... > > > > I will change words to "must" > > > > > > > -Example: > > > - > > > +Example with integrated PHY: > > > emac: ethernet@1c0b000 { > > > compatible = "allwinner,sun8i-h3-emac"; > > > syscon = <>; > > > @@ -72,13 +86,112 @@ emac: ethernet@1c0b000 { > > > phy-handle = <_mii_phy>; > > > phy-mode = "mii"; > > > allwinner,leds-active-low; > > > + > > > + mdio0: mdio { > > > > (You don't label it mdio here, unlike what was asked before) > > > > > + #address-cells = <1>; > > > + #size-cells = <0>; > > > + compatible = "snps,dwmac-mdio"; > > > + }; > > > > I think Rob wanted that node gone? > > > > MDIO mux does not work without a parent MDIO, either gived by "parent-bus" or > directly via mdio_mux_init() (like it is the case in dwmac-sun8i) Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"? If the latter, then I think the node is fine, but then the mux should be a child node of it. IOW, the child of an MDIO controller should either be a mux node or slave devices. > > > > + mdio-mux { > > > + compatible = "mdio-mux"; > > > + #address-cells = <1>; > > > + #size-cells = <0>; > > > + > > > + int_mdio: mdio@1 { > > > + reg = <0>; unit address of 1 and reg prop of 0 don't match. Build your dtb with W=2. > > > + #address-cells = <1>; > > > + #size-cells = <0>; > > > + int_mii_phy: ethernet-phy@1 { > > > + reg = <1>; > > > + clocks = < CLK_BUS_EPHY>; > > > + resets = < RST_BUS_EPHY>; > > > + phy-is-integrated Missing ; > > > + }; > > > + }; > > > > ... And in your example it's a child of the mdio mux? > > > > So I confirm, integrated PHY must be a child of integrated MDIO (that must be > a child of mdio-mux). > The example is good. > > > > + ext_mdio: mdio@0 { > > > + reg = <1>; Another unit address mismatch. Rob
[PATCH net] net_sched: gen_estimator: fix scaling error in bytes/packets samples
From: Eric DumazetDenys reported wrong rate estimations with HTB classes. It appears the bug was added in linux-4.10, since my tests where using intervals of one second only. HTB using 4 sec default rate estimators, reported rates were 4x higher. We need to properly scale the bytes/packets samples before integrating them in EWMA. Tested: echo 1 >/sys/module/sch_htb/parameters/htb_rate_est Setup HTB with one class with a rate/cail of 5Gbit Generate traffic on this class tc -s -d cl sh dev eth0 classid 7002:11 class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil 5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b level 0 rate_handle 1 Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0 requeues 0) rate 5Gbit 412814pps backlog 136260b 2p requeues 0 TCP pkts/rtx 982969327/45 bytes 1488215557414/68130 lended: 22732826 borrowed: 0 giants: 0 tokens: -1684 ctokens: -1684 Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Eric Dumazet Reported-by: Denys Fedoryshchenko --- net/core/gen_estimator.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index 0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -83,10 +83,10 @@ static void est_timer(unsigned long arg) u64 rate, brate; est_fetch_counters(est, ); - brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log); + brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - est->intvl_log); brate -= (est->avbps >> est->ewma_log); - rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log); + rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - est->intvl_log); rate -= (est->avpps >> est->ewma_log); write_seqcount_begin(>seq);
Re: [PATCH v5 02/10] dt-bindings: net: Restore sun8i dwmac binding
On Fri, Sep 08, 2017 at 09:11:48AM +0200, Corentin Labbe wrote: > This patch restore dt-bindings documentation about dwmac-sun8i > This reverts commit 8aa33ec2f481 ("dt-bindings: net: Revert sun8i dwmac > binding") Why? > > Signed-off-by: Corentin Labbe> --- > .../devicetree/bindings/net/dwmac-sun8i.txt| 84 > ++ > 1 file changed, 84 insertions(+) > create mode 100644 Documentation/devicetree/bindings/net/dwmac-sun8i.txt Otherwise, Acked-by: Rob Herring
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 10:58 -0700, Eric Dumazet wrote: > On Wed, 2017-09-13 at 20:35 +0300, Denys Fedoryshchenko wrote: > > > Overlimits never appear in HTB as i know, here is simulation on this > > class that have constant "at least" 1G traffic, i throttled it to 1Kbit > > to simulate forced drops: > > > > shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev > > eth3.777 classid 1:111;tc qdisc del dev eth3.777 root > > class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit > > ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu > > 0b level 0 > > Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0) > > backlog 7902126b 4976p requeues 0 > > lended: 86694 borrowed: 0 giants: 0 > > tokens: -93750 ctokens: -93750 > > > > Oh right, I am using a local patch for this. Time to upstream :/ Please try : diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 7e148376ba528efabe5a53a09653f9161c264be7..c6d7ae81b41f4e277afb93a3003fefcd3f27de35 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -142,6 +142,7 @@ struct htb_class { struct rb_node node[TC_HTB_NUMPRIO]; /* node for self or feed tree */ unsigned int drops cacheline_aligned_in_smp; + unsigned intoverlimits; }; struct htb_level { @@ -533,6 +534,9 @@ htb_change_class_mode(struct htb_sched *q, struct htb_class *cl, s64 *diff) if (new_mode == cl->cmode) return; + if (new_mode == HTB_CANT_SEND) + cl->overlimits++; + if (cl->prio_activity) {/* not necessary: speed optimization */ if (cl->cmode != HTB_CANT_SEND) htb_deactivate_prios(q, cl); @@ -1143,6 +1147,7 @@ htb_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d) struct htb_class *cl = (struct htb_class *)arg; struct gnet_stats_queue qs = { .drops = cl->drops, + .overlimits = cl->overlimits, }; __u32 qlen = 0;
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 20:35 +0300, Denys Fedoryshchenko wrote: > Overlimits never appear in HTB as i know, here is simulation on this > class that have constant "at least" 1G traffic, i throttled it to 1Kbit > to simulate forced drops: > > shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev > eth3.777 classid 1:111;tc qdisc del dev eth3.777 root > class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit > ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu > 0b level 0 > Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0) > backlog 7902126b 4976p requeues 0 > lended: 86694 borrowed: 0 giants: 0 > tokens: -93750 ctokens: -93750 > Oh right, I am using a local patch for this. Time to upstream :/
Re: Memory leaks in conntrack
On Wed, Sep 13, 2017 at 9:45 AM, Cong Wangwrote: > On Wed, Sep 13, 2017 at 1:05 AM, Florian Westphal wrote: >> Cong Wang wrote: >>> While testing my TC filter patches (so not related to conntrack), the >>> following memory leaks are shown up: >>> >>> unreferenced object 0x9b19ba551228 (size 128): >>> comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s) >>> hex dump (first 32 bytes): >>> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b >>> 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00 ...0 >>> backtrace: >>> [] create_object+0x169/0x2aa >>> [] kmemleak_alloc+0x25/0x41 >>> [] slab_post_alloc_hook+0x44/0x65 >>> [] __kmalloc_track_caller+0x113/0x146 >>> [] __krealloc+0x4a/0x69 >>> [] nf_ct_ext_add+0xe1/0x145 >>> [] init_conntrack+0x1f7/0x36e >>> [] nf_conntrack_in+0x1d3/0x326 >>> [] ipv4_conntrack_local+0x4d/0x50 >>> [] nf_hook_slow+0x3c/0x9b >>> [] nf_hook.constprop.40+0xbe/0xd8 >>> [] __ip_local_out+0xb3/0xbf >>> [] ip_local_out+0x1c/0x36 >>> [] ip_send_skb+0x19/0x3d >>> [] udp_send_skb+0x17e/0x1df >>> [] udp_sendmsg+0x5a2/0x77c >>> unreferenced object 0x9b19a69b3340 (size 336): >>> comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s) >>> hex dump (first 32 bytes): >>> 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de .N.. >>> ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff >>> backtrace: >>> [] create_object+0x169/0x2aa >>> [] kmemleak_alloc+0x25/0x41 >>> [] slab_post_alloc_hook+0x44/0x65 >>> [] kmem_cache_alloc+0xd7/0x1f1 >>> [] __nf_conntrack_alloc+0xa2/0x146 >>> [] init_conntrack+0xb2/0x36e >>> [] nf_conntrack_in+0x1d3/0x326 >>> [] ipv4_conntrack_local+0x4d/0x50 >>> [] nf_hook_slow+0x3c/0x9b >>> [] nf_hook.constprop.40+0xbe/0xd8 >>> [] __ip_local_out+0xb3/0xbf >>> [] ip_local_out+0x1c/0x36 >>> [] ip_send_skb+0x19/0x3d >>> [] udp_send_skb+0x17e/0x1df >>> [] udp_sendmsg+0x5a2/0x77c >>> [] inet_sendmsg+0x37/0x5e >>> >>> I don't touch chronyd in my VM, so I have no idea why it sends out UDP >>> packets, my guess is it is some periodical packet. >>> >>> I don't think I use conntrack either, since /proc/net/ip_conntrack >>> does not exist. >> >> You probably do, can you try "cat /proc/net/nf_conntrack" instead? >> >> (otherwise there should be no ipv4_conntrack_local() invocation >> since we would not register this hook at all). > > Yeah it is very weird but it is true: > > [root@localhost ~]# echo scan > /sys/kernel/debug/kmemleak > [ 133.450823] kmemleak: 18 new suspected memory leaks (see > /sys/kernel/debug/kmemleak) > [root@localhost ~]# cat /proc/net/ip_conntrack > cat: /proc/net/ip_conntrack: No such file or directory > [root@localhost ~]# cat /sys/kernel/debug/kmemleak > unreferenced object 0x95c1e0b24040 (size 336): > ... Oops, you mean nf_conntrack... Here we go: [root@localhost ~]# cat /proc/net/nf_conntrack ipv4 2 udp 17 116 src=192.168.124.6 dst=204.2.134.162 sport=123 dport=123 src=204.2.134.162 dst=192.168.124.6 sport=123 dport=123 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 117 src=192.168.124.6 dst=45.79.187.10 sport=123 dport=123 src=45.79.187.10 dst=192.168.124.6 sport=123 dport=123 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 110 src=192.168.124.6 dst=192.168.124.1 sport=35486 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53 dport=35486 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 110 src=192.168.124.6 dst=192.168.124.1 sport=52373 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53 dport=52373 [ASSURED] mark=0 zone=0 use=2 ipv4 2 unknown 2 518 src=192.168.124.6 dst=224.0.0.22 [UNREPLIED] src=224.0.0.22 dst=192.168.124.6 mark=0 zone=0 use=2 ipv4 2 udp 17 110 src=192.168.124.6 dst=192.168.124.1 sport=43242 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53 dport=43242 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 116 src=192.168.124.6 dst=96.226.123.196 sport=123 dport=123 src=96.226.123.196 dst=192.168.124.6 sport=123 dport=123 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 110 src=192.168.124.6 dst=192.168.124.1 sport=42838 dport=53 src=192.168.124.1 dst=192.168.124.6 sport=53 dport=42838 [ASSURED] mark=0 zone=0 use=2 ipv4 2 udp 17 117 src=192.168.124.6 dst=97.127.104.4 sport=123 dport=123 src=97.127.104.4 dst=192.168.124.6 sport=123 dport=123 [ASSURED] mark=0 zone=0 use=2
Re: [PATCH/RFC net-next] ravb: RX checksum offload
Hello! On 09/12/2017 04:04 PM, Simon Horman wrote: Add support for RX checksum offload. This is enabled by default and may be disabled and re-enabled using ethtool: # ethtool -K eth0 rx off # ethtool -K eth0 rx on The RAVB provides a simple checksumming scheme which appears to be completely compatible with CHECKSUM_COMPLETE: a 1's complement sum of Hm, the gen2/3 manuals say calculation doesn't involve bit inversion... all packet data after the L2 header is appended to packet data; this may be trivially read by the driver and used to update the skb accordingly. > In terms of performance throughput is close to gigabit line-rate both with and without RX checksum offload enabled. Perf output, however, appears to indicate that significantly less time is spent in do_csum(). This is as expected. [...] By inspection this also appears to be compatible with the ravb found on R-Car Gen 2 SoCs, however, this patch is currently untested on such hardware. I probably won't be able to test it on gen2 too... Signed-off-by: Simon HormanI'm generally OK with the patch but have some questions/comments below... --- drivers/net/ethernet/renesas/ravb_main.c | 58 +++- 1 file changed, 57 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c index fdf30bfa403b..7c6438cd7de7 100644 --- a/drivers/net/ethernet/renesas/ravb_main.c +++ b/drivers/net/ethernet/renesas/ravb_main.c [...] @@ -1842,6 +1859,41 @@ static int ravb_do_ioctl(struct net_device *ndev, struct ifreq *req, int cmd) return phy_mii_ioctl(phydev, req, cmd); } +static void ravb_set_rx_csum(struct net_device *ndev, bool enable) +{ + struct ravb_private *priv = netdev_priv(ndev); + unsigned long flags; + + spin_lock_irqsave(>lock, flags); + + /* Disable TX and RX */ + ravb_rcv_snd_disable(ndev); + + /* Modify RX Checksum setting */ + if (enable) + ravb_modify(ndev, ECMR, 0, ECMR_RCSC); Please use ECMR_RCSC as the 3rd argument too to conform the common driver style. + else + ravb_modify(ndev, ECMR, ECMR_RCSC, 0); This *if* can easily be folded into a single ravb_modify() call... [...] @@ -2004,6 +2057,9 @@ static int ravb_probe(struct platform_device *pdev) if (!ndev) return -ENOMEM; + ndev->features |= NETIF_F_RXCSUM; + ndev->hw_features |= ndev->features; Hum, both fields are 0 before this? Then why not use '=' instead of '|='? Even if not, why not just use the same value as both the rvalues? [...] MBR, Sergei
[PATCH v2] net: smsc911x: Quieten netif during suspend
If the network interface is kept running during suspend, the net core may call net_device_ops.ndo_start_xmit() while the Ethernet device is still suspended, which may lead to a system crash. E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is driven by a PM controlled clock. If the Ethernet registers are accessed while the clock is not running, the system will crash with an imprecise external abort. As this is a race condition with a small time window, it is not so easy to trigger at will. Using pm_test may increase your chances: # echo 0 > /sys/module/printk/parameters/console_suspend # echo platform > /sys/power/pm_test # echo mem > /sys/power/state To fix this, make sure the network interface is quietened during suspend. Signed-off-by: Geert Uytterhoeven--- This is v2 of the series "[PATCH 0/2] net: Fix crashes due to activity during suspend", which degenerated into a single patch after commit ebc8254aeae34226 ("Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"") made "[PATCH 1/2] net: phy: Freeze PHY polling before suspending devices" no longer needed. v2: - Spelling s/quit/quiet/g. No stacktrace is provided, as the imprecise external abort is usually reported from an innocent looking and unrelated function like __loop_delay(), cpu_idle_poll(), or arch_timer_read_counter_long(). --- drivers/net/ethernet/smsc/smsc911x.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c index 0b6a39b003a4e188..012fb66eed8dd618 100644 --- a/drivers/net/ethernet/smsc/smsc911x.c +++ b/drivers/net/ethernet/smsc/smsc911x.c @@ -2595,6 +2595,11 @@ static int smsc911x_suspend(struct device *dev) struct net_device *ndev = dev_get_drvdata(dev); struct smsc911x_data *pdata = netdev_priv(ndev); + if (netif_running(ndev)) { + netif_stop_queue(ndev); + netif_device_detach(ndev); + } + /* enable wake on LAN, energy detection and the external PME * signal. */ smsc911x_reg_write(pdata, PMT_CTRL, @@ -2628,7 +2633,15 @@ static int smsc911x_resume(struct device *dev) while (!(smsc911x_reg_read(pdata, PMT_CTRL) & PMT_CTRL_READY_) && --to) udelay(1000); - return (to == 0) ? -EIO : 0; + if (to == 0) + return -EIO; + + if (netif_running(ndev)) { + netif_device_attach(ndev); + netif_start_queue(ndev); + } + + return 0; } static const struct dev_pm_ops smsc911x_pm_ops = { -- 2.7.4
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
On 09/13/2017 01:28 PM, Josef Bacik wrote: > Sorry I thought I had made this other fix, can you apply this on top of the > other one and try that? I have more things to try if this doesn’t work, > sorry you are playing go between, but I want to make sure I know _which_ fix > actually fixes the problem, and then clean up in followup patches. Thanks, > I'm the bug reporter. I'll combine the two patches and report back Thanks, Cole
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On 2017-09-13 20:20, Eric Dumazet wrote: On Wed, 2017-09-13 at 20:12 +0300, Denys Fedoryshchenko wrote: For my case, as load increased now, i am hitting same issue (i tried to play with quantum / bursts as well, didnt helped): tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class show dev eth3.777 classid 1:111 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 0 Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 requeues 0) backlog 73968000b 39934p requeues 0 lended: 499867 borrowed: 0 giants: 0 tokens: 608 ctokens: 121 You have drops (and ~40,000 packets in backlog) class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 0 Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 requeues 0) backlog 82536047b 39963p requeues 0 lended: 810475 borrowed: 0 giants: 0 tokens: 612 ctokens: 122 (1469352160-864151559)/5*8 968320961.6000 Less than 1Gbit and it's being throttled It is not : "overlimits 0" means this class was not throttled. Overlimits never appear in HTB as i know, here is simulation on this class that have constant "at least" 1G traffic, i throttled it to 1Kbit to simulate forced drops: shapernew ~ # sh /etc/shaper.cfg;sleep 1;tc -s -d class show dev eth3.777 classid 1:111;tc qdisc del dev eth3.777 root class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 1Kbit ceil 1Kbit linklayer ethernet burst 31280b/1 mpu 0b cburst 31280b/1 mpu 0b level 0 Sent 134350019 bytes 117520 pkt (dropped 7819, overlimits 0 requeues 0) backlog 7902126b 4976p requeues 0 lended: 86694 borrowed: 0 giants: 0 tokens: -93750 ctokens: -93750
Re: [PATCH 0/2] net: Fix crashes due to activity during suspend
Hi Florian, On Thu, Sep 7, 2017 at 3:09 PM, Florian Fainelliwrote: > On 08/23/2017 10:13 AM, Florian Fainelli wrote: >> On 08/23/2017 04:45 AM, Geert Uytterhoeven wrote: >>> On Tue, Aug 22, 2017 at 8:49 PM, Florian Fainelli >>> wrote: On 08/22/2017 11:37 AM, Geert Uytterhoeven wrote: > If an Ethernet device is used while the device is suspended, the system > may > crash. > > E.g. on sh73a0/kzm9g and r8a73a4/ape6evm, the external Ethernet chip is > driven by a PM controlled clock. If the Ethernet registers are accessed > while the clock is not running, the system will crash with an imprecise > external abort. > > This patch series fixes two of such crashes: > 1. The first patch prevents the PHY polling state machine from accessing > PHY registers while a device is suspended, > 2. The second patch prevents the net core from trying to transmit > packets > when an smsc911x device is suspended. > > Both crashes can be reproduced on sh73a0/kzm9g and r8a73a4/ape6evm during > s2ram (rarely), or by using pm_test (more likely to trigger): > > # echo 0 > /sys/module/printk/parameters/console_suspend > # echo platform > /sys/power/pm_test > # echo mem > /sys/power/state > > With this series applied, my test systems survive a loop of 100 test > suspends. It seems to me like part, if not the entire problem is that smsc91xx's suspend and resume functions are way too simplistic and absolutely do not manage the PHY during suspend/resume, the PHY state machine is not even stopped, so of course, this will cause bus errors if you access those registers. You are addressing this as part of patch 2, but this seems to me like this is still a bit incomplete and you'd need at least phy_stop() and/or phy_suspend() (does a power down of the PHY) and phy_start() and/or phy_resume() calls to complete the PHY state machine shutdown during suspend. Have you tried that? >>> >>> Unfortunately that doesn't help. >>> In state PHY_HALTED, the PHY state machine still calls the .adjust_link() >>> callback while the device is suspended. >> >> Humm that is correct yes. >> >>> Do you have a clue? This is too far beyond my phy-foo... >> >> I was initially contemplating a revert of >> 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process >> PHY_HALTED in phy_stop_machine()") but this is not the root of the >> problem. The problem really is that phy_stop() does not wait for the PHY >> state machine to be stopped so you cannot rely on that and past the >> function return be offered any guarantees that adjust_link is not called. >> >> We seem to be getting away with that in most drivers because when we see >> phydev->link = 0, we either do nothing or actually turn of the HW block. >> >> How about we export phy_stop_machine() to drivers which would provide a >> synchronization point that would ensure that no HW accesses are done >> past this point? >> >> I am absolutely not clear on the implications of using a freezable >> workqueue with respect to the PHY state machine and how devices are >> going to wind-up being powered down or not... > > Geert, as you may have notice a revert of the change was sent so 4.13 > should be fine, but ultimately I would like to put the non-reverted code > back in after we add a few safeguards: With the revert, I no longer need "[PATCH 1/2] net: phy: Freeze PHY polling before suspending devices". I just did more than 50 successful suspend/resume cycles to verify that. I still need "[PATCH 2/2] net: smsc911x: Quiten netif during suspend", so I'll submit a v2 for that. > - and you reported the bus errors on smsc911x when we call adjust_link > during suspend, and due to a lack of hard synchronization so phy_stop() > here does not give you enough guarantees to let you turn off power to > the smsc911x block > > If that seems accurate then we can work on something that should be > working again (famous last words). Sounds accurate to me. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
Sorry I thought I had made this other fix, can you apply this on top of the other one and try that? I have more things to try if this doesn’t work, sorry you are playing go between, but I want to make sure I know _which_ fix actually fixes the problem, and then clean up in followup patches. Thanks, Josef On 9/13/17, 8:45 AM, "Laura Abbott"wrote: On 09/12/2017 04:12 PM, Josef Bacik wrote: > First I’m super sorry for the top post, I’m at plumbers and I forgot to > upload my muttrc to my new cloud instance, so I’m screwed using outlook. > > I have a completely untested, uncompiled patch that I think will fix the > problem, would you mind giving it a go? Thanks, > > Josef Thanks for the quick turnaround. Unfortunately, the problem is still reproducible according to the reporter. Thanks, Laura 0001-net-use-inet6_rcv_saddr-to-compare-sockets.patch Description: 0001-net-use-inet6_rcv_saddr-to-compare-sockets.patch
Re: [PATCH net] perf/bpf: fix a clang compilation issue
great, thanks! On Mon, Sep 11, 2017 at 2:28 PM, David Millerwrote: > From: Yonghong Song > Date: Thu, 7 Sep 2017 18:36:15 -0700 > >> clang does not support variable length array for structure member. >> It has the following error during compilation: >> >> kernel/trace/trace_syscalls.c:568:17: error: fields must have a constant >> size: >> 'variable length array in structure' extension will never be supported >> unsigned long args[sys_data->nb_args]; >> ^ >> >> The fix is to use a fixed array length instead. >> >> Reported-by: Nick Desaulniers >> Signed-off-by: Yonghong Song > > Applied. -- Thanks, ~Nick Desaulniers
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 20:12 +0300, Denys Fedoryshchenko wrote: > For my case, as load increased now, i am hitting same issue (i tried to > play with quantum / bursts as well, didnt helped): > > tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class > show dev eth3.777 classid 1:111 > class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit > ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 > mpu 0b level 0 > Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 requeues > 0) > backlog 73968000b 39934p requeues 0 > lended: 499867 borrowed: 0 giants: 0 > tokens: 608 ctokens: 121 > You have drops (and ~40,000 packets in backlog) > class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit > ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 > mpu 0b level 0 > Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 requeues > 0) > backlog 82536047b 39963p requeues 0 > lended: 810475 borrowed: 0 giants: 0 > tokens: 612 ctokens: 122 > > (1469352160-864151559)/5*8 > 968320961.6000 > Less than 1Gbit and it's being throttled It is not : "overlimits 0" means this class was not throttled.
[PATCH net v2 1/3] nfp: add whitelist of supported flow dissector
From: Pieter Jansen van VuurenPreviously we did not check the flow dissector against a list of allowed and supported flow key dissectors. This patch introduces such a list and correctly rejects unsupported flow keys. Fixes: 43f84b72c50d ("nfp: add metadata to each flow offload") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index d396183108f7..a18b4d2b1d3e 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -44,6 +44,16 @@ #include "../nfp_net.h" #include "../nfp_port.h" +#define NFP_FLOWER_WHITELIST_DISSECTOR \ + (BIT(FLOW_DISSECTOR_KEY_CONTROL) | \ +BIT(FLOW_DISSECTOR_KEY_BASIC) | \ +BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_PORTS) | \ +BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_VLAN) | \ +BIT(FLOW_DISSECTOR_KEY_IP)) + static int nfp_flower_xmit_flow(struct net_device *netdev, struct nfp_fl_payload *nfp_flow, u8 mtype) @@ -112,6 +122,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls, u8 key_layer; int key_size; + if (flow->dissector->used_keys & ~NFP_FLOWER_WHITELIST_DISSECTOR) + return -EOPNOTSUPP; + if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL)) { struct flow_dissector_key_control *mask_enc_ctl = -- 2.14.1
[PATCH net v2 0/3] nfp: wait more carefully for card init
Hi! The first patch is a small fix for flower offload, we need a whitelist of supported matches, otherwise the unsupported ones will be ignored. The second and the third patch are adding wait/polling to the probe path. We had reports of driver failing probe because it couldn't find the control process (NSP) on the card. Turns out the NSP will only announce its existence after it's fully initialized. Until now we assumed it will be reachable, just not processing commands (hence we wait for a NOOP command to execute successfully). v2: - fix a bad merge which resulted in a build warning and retest. Jakub Kicinski (2): nfp: wait for board state before talking to the NSP nfp: wait for the NSP resource to appear on boot Pieter Jansen van Vuuren (1): nfp: add whitelist of supported flow dissector .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++ drivers/net/ethernet/netronome/nfp/nfp_main.c | 47 ++ drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 --- drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h | 2 + .../ethernet/netronome/nfp/nfpcore/nfp_resource.c | 45 + 5 files changed, 107 insertions(+), 23 deletions(-) -- 2.14.1
[PATCH net v2 3/3] nfp: wait for the NSP resource to appear on boot
The control process (NSP) may take some time to complete its initialization. This is not a problem on most servers, but on very fast-booting machines it may not be ready for operation when driver probes the device. There is also a version of the flash in the wild where NSP tries to train the links as part of init. To wait for NSP initialization we should make sure its resource has already been added to the resource table. NSP adds itself there as last step of init. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 4 ++ drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h | 2 + .../ethernet/netronome/nfp/nfpcore/nfp_resource.c | 45 ++ 3 files changed, 51 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index 424707d41fbd..f8fa63b66739 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -351,6 +351,10 @@ static int nfp_nsp_init(struct pci_dev *pdev, struct nfp_pf *pf) struct nfp_nsp *nsp; int err; + err = nfp_resource_wait(pf->cpp, NFP_RESOURCE_NSP, 30); + if (err) + return err; + nsp = nfp_nsp_open(pf->cpp); if (IS_ERR(nsp)) { err = PTR_ERR(nsp); diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h index 1a8d04a1e113..3ce51f03126f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h @@ -97,6 +97,8 @@ nfp_resource_acquire(struct nfp_cpp *cpp, const char *name); void nfp_resource_release(struct nfp_resource *res); +int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs); + u32 nfp_resource_cpp_id(struct nfp_resource *res); const char *nfp_resource_name(struct nfp_resource *res); diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c index 072612263dab..b1dd13ff282b 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c @@ -249,6 +249,51 @@ void nfp_resource_release(struct nfp_resource *res) kfree(res); } +/** + * nfp_resource_wait() - Wait for resource to appear + * @cpp: NFP CPP handle + * @name: Name of the resource + * @secs: Number of seconds to wait + * + * Wait for resource to appear in the resource table, grab and release + * its lock. The wait is jiffies-based, don't expect fine granularity. + * + * Return: 0 on success, errno otherwise. + */ +int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs) +{ + unsigned long warn_at = jiffies + NFP_MUTEX_WAIT_FIRST_WARN * HZ; + unsigned long err_at = jiffies + secs * HZ; + struct nfp_resource *res; + + while (true) { + res = nfp_resource_acquire(cpp, name); + if (!IS_ERR(res)) { + nfp_resource_release(res); + return 0; + } + + if (PTR_ERR(res) != -ENOENT) { + nfp_err(cpp, "error waiting for resource %s: %ld\n", + name, PTR_ERR(res)); + return PTR_ERR(res); + } + if (time_is_before_eq_jiffies(err_at)) { + nfp_err(cpp, "timeout waiting for resource %s\n", name); + return -ETIMEDOUT; + } + if (time_is_before_eq_jiffies(warn_at)) { + warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ; + nfp_info(cpp, "waiting for NFP resource %s\n", name); + } + if (msleep_interruptible(10)) { + nfp_err(cpp, "wait for resource %s interrupted\n", + name); + return -ERESTARTSYS; + } + } +} + /** * nfp_resource_cpp_id() - Return the cpp_id of a resource handle * @res: NFP Resource handle -- 2.14.1
[PATCH net v2 2/3] nfp: wait for board state before talking to the NSP
Board state informs us which low-level initialization stages the card has completed. We should wait for the card to be fully initialized before trying to communicate with it, not only before we configure passing traffic. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_main.c | 43 +++ drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 2 files changed, 43 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index f055b1774d65..424707d41fbd 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -74,6 +74,45 @@ static const struct pci_device_id nfp_pci_device_ids[] = { }; MODULE_DEVICE_TABLE(pci, nfp_pci_device_ids); +static bool nfp_board_ready(struct nfp_pf *pf) +{ + const char *cp; + long state; + int err; + + cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state"); + if (!cp) + return false; + + err = kstrtol(cp, 0, ); + if (err < 0) + return false; + + return state == 15; +} + +static int nfp_pf_board_state_wait(struct nfp_pf *pf) +{ + const unsigned long wait_until = jiffies + 10 * HZ; + + while (!nfp_board_ready(pf)) { + if (time_is_before_eq_jiffies(wait_until)) { + nfp_err(pf->cpp, "NFP board initialization timeout\n"); + return -EINVAL; + } + + nfp_info(pf->cpp, "waiting for board initialization\n"); + if (msleep_interruptible(500)) + return -ERESTARTSYS; + + /* Refresh cached information */ + kfree(pf->hwinfo); + pf->hwinfo = nfp_hwinfo_read(pf->cpp); + } + + return 0; +} + static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf) { int err; @@ -425,6 +464,10 @@ static int nfp_pci_probe(struct pci_dev *pdev, nfp_hwinfo_lookup(pf->hwinfo, "assembly.revision"), nfp_hwinfo_lookup(pf->hwinfo, "cpld.version")); + err = nfp_pf_board_state_wait(pf); + if (err) + goto err_hwinfo_free; + err = devlink_register(devlink, >dev); if (err) goto err_hwinfo_free; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index 5abb9ba31e7d..ff373acd28f3 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -64,23 +64,6 @@ #define NFP_PF_CSR_SLICE_SIZE (32 * 1024) -static int nfp_is_ready(struct nfp_pf *pf) -{ - const char *cp; - long state; - int err; - - cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state"); - if (!cp) - return 0; - - err = kstrtol(cp, 0, ); - if (err < 0) - return 0; - - return state == 15; -} - /** * nfp_net_get_mac_addr() - Get the MAC address. * @pf: NFP PF handle @@ -725,12 +708,6 @@ int nfp_net_pci_probe(struct nfp_pf *pf) INIT_WORK(>port_refresh_work, nfp_net_refresh_vnics); - /* Verify that the board has completed initialization */ - if (!nfp_is_ready(pf)) { - nfp_err(pf->cpp, "NFP is not ready for NIC operation.\n"); - return -EINVAL; - } - if (!pf->rtbl) { nfp_err(pf->cpp, "No %s, giving up.\n", pf->fw_loaded ? "symbol table" : "firmware found"); -- 2.14.1
RFC: Audit Kernel Container IDs
Containers are a userspace concept. The kernel knows nothing of them. The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this. Since the concept of a container is entirely a userspace concept, a trigger signal from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID. The trigger is a pseudo filesystem (proc, since PID tree already exists) write of a u64 representing the container ID to a file representing a process that will become the first process in a new container. This might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo filesystem to have this action permitted. At that time, record the child container's user-supplied 64-bit container identifier along with the child container's first process (which may become the container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field. Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' audit_context struct. Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process. Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. A process can be moved from one container to another by using the container assignment method outlined above a second time. When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces. A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container. Feedback please. - RGB -- Richard Guy BriggsSr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On 2017-09-13 19:55, Eric Dumazet wrote: On Wed, 2017-09-13 at 09:42 -0700, Eric Dumazet wrote: On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote: > On 2017-09-13 19:16, Eric Dumazet wrote: > > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote: > >> Well, probably i am answering my own question, removing estimator from > >> classes seems drastically improve situation. > >> It seems estimator has some issues that cause shaper to behave > >> incorrectly (throttling traffic while it should not). > >> But i guess thats a bug? > >> As i was not able to predict such bottleneck by CPU load measurements. > > > > Well, there was a reason we disabled HTB class estimators by default ;) > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9 > > As soon as disabling it solve my problem - i'm fine, hehe, but i guess > other people who might hit this problem, should be aware how to find > reason. > They should not be disappointed in Linux :) Well, if they enable rate estimators while kernel does not set them by default, they get what they want, at a cost. > Because i can't measure this bottleneck before it happens, i'm seeing on > mpstat all cpu's are idle, and same time traffic is throttled. Normally things were supposed to get much better in linux-4.10 ( https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114 ) But I apparently added a scaling bug. I will try : diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index 0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -83,10 +83,10 @@ static void est_timer(unsigned long arg) u64 rate, brate; est_fetch_counters(est, ); - brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log); + brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - est->intvl_log); brate -= (est->avbps >> est->ewma_log); - rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log); + rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - est->intvl_log); rate -= (est->avpps >> est->ewma_log); write_seqcount_begin(>seq); Much better indeed # tc -s -d class sh dev eth0 classid 7002:11 ; sleep 10 ;tc -s -d class sh dev eth0 classid 7002:11 class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil 5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b level 0 rate_handle 1 Sent 389085117074 bytes 256991500 pkt (dropped 0, overlimits 5926926 requeues 0) rate 4999Mbit 412762pps backlog 136260b 2p requeues 0 TCP pkts/rtx 256991584/0 bytes 389085252840/0 lended: 5961250 borrowed: 0 giants: 0 tokens: -1664 ctokens: -1664 class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil 5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b level 0 rate_handle 1 Sent 395336315580 bytes 261120429 pkt (dropped 0, overlimits 6021776 requeues 0) rate 4999Mbit 412788pps backlog 68Kb 2p requeues 0 TCP pkts/rtx 261120469/0 bytes 395336384730/0 lended: 6056793 borrowed: 0 giants: 0 tokens: -1478 ctokens: -1478 echo "(395336315580-389085117074)/10*8" | bc 5000958800 For my case, as load increased now, i am hitting same issue (i tried to play with quantum / bursts as well, didnt helped): tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class show dev eth3.777 classid 1:111 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 0 Sent 864151559 bytes 730566 pkt (dropped 15111, overlimits 0 requeues 0) backlog 73968000b 39934p requeues 0 lended: 499867 borrowed: 0 giants: 0 tokens: 608 ctokens: 121 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 20Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 0 Sent 1469352160 bytes 1243649 pkt (dropped 42933, overlimits 0 requeues 0) backlog 82536047b 39963p requeues 0 lended: 810475 borrowed: 0 giants: 0 tokens: 612 ctokens: 122 (1469352160-864151559)/5*8 968320961.6000 Less than 1Gbit and it's being throttled Total bandwidth: class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 7 Sent 7839730635 bytes 8537393 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 123 ctokens: 123 class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 10b/1 mpu 0b cburst 10b/1 mpu 0b level 7 Sent 11043190453 bytes 12008366 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: 124 ctokens: 124 694kpps 5.1Gbit
Re: Regression in throughput between kvm guests over virtual bridge
On 09/13/2017 04:13 AM, Jason Wang wrote: > > > On 2017年09月13日 09:16, Jason Wang wrote: >> >> >> On 2017年09月13日 01:56, Matthew Rosato wrote: >>> We are seeing a regression for a subset of workloads across KVM guests >>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting >>> points to c67df11f "vhost_net: try batch dequing from skb array" >>> >>> In the regressed environment, we are running 4 kvm guests, 2 running as >>> uperf servers and 2 running as uperf clients, all on a single host. >>> They are connected via a virtual bridge. The uperf client profile looks >>> like: >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> So, 1 tcp streaming instance per client. When upgrading the host kernel >>> from 4.12->4.13, we see about a 30% drop in throughput for this >>> scenario. After the bisect, I further verified that reverting c67df11f >>> on 4.13 "fixes" the throughput for this scenario. >>> >>> On the other hand, if we increase the load by upping the number of >>> streaming instances to 50 (nprocs="50") or even 10, we see instead a >>> ~10% increase in throughput when upgrading host from 4.12->4.13. >>> >>> So it may be the issue is specific to "light load" scenarios. I would >>> expect some overhead for the batching, but 30% seems significant... Any >>> thoughts on what might be happening here? >>> >> >> Hi, thanks for the bisecting. Will try to see if I can reproduce. >> Various factors could have impact on stream performance. If possible, >> could you collect the #pkts and average packet size during the test? >> And if you guest version is above 4.12, could you please retry with >> napi_tx=true? Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 - 4.4.0-93-generic specifically). Here's a throughput report (uperf) and #pkts and average packet size (tcpstat) for one of the uperf clients: host 4.12 / guest 4.4: throughput: 29.98Gb/s #pkts=33465571 avg packet size=33755.70 host 4.13 / guest 4.4: throughput: 20.36Gb/s #pkts=21233399 avg packet size=36130.69 I ran the test again using net-next.git as guest kernel, with and without napi_tx=true. napi_tx did not seem to have any significant impact on throughput. However, the guest kernel shift from 4.4->net-next improved things. I can still see a regression between host 4.12 and 4.13, but it's more on the order of 10-15% - another sample: host 4.12 / guest net-next (without napi_tx): throughput: 28.88Gb/s #pkts=31743116 avg packet size=33779.78 host 4.13 / guest net-next (without napi_tx): throughput: 24.34Gb/s #pkts=25532724 avg packet size=35963.20 >> >> Thanks > > Unfortunately, I could not reproduce it locally. I'm using net-next.git > as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz > for both before and after the commit. I use 1 vcpu and 1 queue, and pin > vcpu and vhost threads into separate cpu on host manually (in same numa > node). The environment is quite a bit different -- I'm running in an LPAR on a z13 (s390x). We've seen the issue in various configurations, the smallest thus far was a host partition w/ 40G and 20 CPUs defined (the numbers above were gathered w/ this configuration). Each guest has 4GB and 4 vcpus. No pinning / affinity configured. > > Can you hit this regression constantly and what's you qemu command line Yes, the regression seems consistent. I can try tweaking some of the host and guest definitions to see if it makes a difference. The guests are instantiated from libvirt - Here's one of the resulting qemu command lines: /usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes -machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -drive file=/dev/disk/by-id/scsi-3600507630bffc0381803,format=raw,if=none,id=drive-virtio-disk0 -device virtio-blk-ccw,scsi=off,devno=fe.0.,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001 -netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1 -chardev pty,id=charconsole0 -device sclpconsole,chardev=charconsole0,id=console0 -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on In the above, net0 is used for a macvtap connection (not used in the experiment, just for a reliable ssh connection - can remove if needed). net1 is the
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 09:42 -0700, Eric Dumazet wrote: > On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote: > > On 2017-09-13 19:16, Eric Dumazet wrote: > > > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote: > > >> Well, probably i am answering my own question, removing estimator from > > >> classes seems drastically improve situation. > > >> It seems estimator has some issues that cause shaper to behave > > >> incorrectly (throttling traffic while it should not). > > >> But i guess thats a bug? > > >> As i was not able to predict such bottleneck by CPU load measurements. > > > > > > Well, there was a reason we disabled HTB class estimators by default ;) > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9 > > > > As soon as disabling it solve my problem - i'm fine, hehe, but i guess > > other people who might hit this problem, should be aware how to find > > reason. > > They should not be disappointed in Linux :) > > Well, if they enable rate estimators while kernel does not set them by > default, they get what they want, at a cost. > > > Because i can't measure this bottleneck before it happens, i'm seeing on > > mpstat all cpu's are idle, and same time traffic is throttled. > > Normally things were supposed to get much better in linux-4.10 > > ( > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114 > ) > > But I apparently added a scaling bug. > > I will try : > > diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c > index > 0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78 > 100644 > --- a/net/core/gen_estimator.c > +++ b/net/core/gen_estimator.c > @@ -83,10 +83,10 @@ static void est_timer(unsigned long arg) > u64 rate, brate; > > est_fetch_counters(est, ); > - brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log); > + brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - > est->intvl_log); > brate -= (est->avbps >> est->ewma_log); > > - rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log); > + rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - > est->intvl_log); > rate -= (est->avpps >> est->ewma_log); > > write_seqcount_begin(>seq); Much better indeed # tc -s -d class sh dev eth0 classid 7002:11 ; sleep 10 ;tc -s -d class sh dev eth0 classid 7002:11 class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil 5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b level 0 rate_handle 1 Sent 389085117074 bytes 256991500 pkt (dropped 0, overlimits 5926926 requeues 0) rate 4999Mbit 412762pps backlog 136260b 2p requeues 0 TCP pkts/rtx 256991584/0 bytes 389085252840/0 lended: 5961250 borrowed: 0 giants: 0 tokens: -1664 ctokens: -1664 class htb 7002:11 parent 7002:1 prio 5 quantum 20 rate 5Gbit ceil 5Gbit linklayer ethernet burst 8b/1 mpu 0b cburst 8b/1 mpu 0b level 0 rate_handle 1 Sent 395336315580 bytes 261120429 pkt (dropped 0, overlimits 6021776 requeues 0) rate 4999Mbit 412788pps backlog 68Kb 2p requeues 0 TCP pkts/rtx 261120469/0 bytes 395336384730/0 lended: 6056793 borrowed: 0 giants: 0 tokens: -1478 ctokens: -1478 echo "(395336315580-389085117074)/10*8" | bc 5000958800
Re: [PATCH net 0/3] nfp: wait more carefully for card init
On Wed, 13 Sep 2017 09:39:02 -0700 (PDT), David Miller wrote: > From: Jakub Kicinski> Date: Wed, 13 Sep 2017 08:51:28 -0700 > > > The first patch is a small fix for flower offload, we need a whitelist > > of supported matches, otherwise the unsupported ones will be ignored. > > > > The second and the third patch are adding wait/polling to the probe path. > > We had reports of driver failing probe because it couldn't find the > > control process (NSP) on the card. Turns out the NSP will only announce > > its existence after it's fully initialized. Until now we assumed it > > will be reachable, just not processing commands (hence we wait for > > a NOOP command to execute successfully). > > Please build test your changes and look at what the compiler says: > > drivers/net/ethernet/netronome/nfp/nfp_main.c: In function ‘nfp_fw_unload’: > drivers/net/ethernet/netronome/nfp/nfp_main.c:395:10: warning: ‘return’ with > a value, in function returning void >return err; > ^~~ > drivers/net/ethernet/netronome/nfp/nfp_main.c:388:13: note: declared here > static void nfp_fw_unload(struct nfp_pf *pf) > ^ Sorry about that!
Re: Memory leaks in conntrack
On Wed, Sep 13, 2017 at 1:05 AM, Florian Westphalwrote: > Cong Wang wrote: >> While testing my TC filter patches (so not related to conntrack), the >> following memory leaks are shown up: >> >> unreferenced object 0x9b19ba551228 (size 128): >> comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s) >> hex dump (first 32 bytes): >> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b >> 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00 ...0 >> backtrace: >> [] create_object+0x169/0x2aa >> [] kmemleak_alloc+0x25/0x41 >> [] slab_post_alloc_hook+0x44/0x65 >> [] __kmalloc_track_caller+0x113/0x146 >> [] __krealloc+0x4a/0x69 >> [] nf_ct_ext_add+0xe1/0x145 >> [] init_conntrack+0x1f7/0x36e >> [] nf_conntrack_in+0x1d3/0x326 >> [] ipv4_conntrack_local+0x4d/0x50 >> [] nf_hook_slow+0x3c/0x9b >> [] nf_hook.constprop.40+0xbe/0xd8 >> [] __ip_local_out+0xb3/0xbf >> [] ip_local_out+0x1c/0x36 >> [] ip_send_skb+0x19/0x3d >> [] udp_send_skb+0x17e/0x1df >> [] udp_sendmsg+0x5a2/0x77c >> unreferenced object 0x9b19a69b3340 (size 336): >> comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s) >> hex dump (first 32 bytes): >> 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de .N.. >> ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff >> backtrace: >> [] create_object+0x169/0x2aa >> [] kmemleak_alloc+0x25/0x41 >> [] slab_post_alloc_hook+0x44/0x65 >> [] kmem_cache_alloc+0xd7/0x1f1 >> [] __nf_conntrack_alloc+0xa2/0x146 >> [] init_conntrack+0xb2/0x36e >> [] nf_conntrack_in+0x1d3/0x326 >> [] ipv4_conntrack_local+0x4d/0x50 >> [] nf_hook_slow+0x3c/0x9b >> [] nf_hook.constprop.40+0xbe/0xd8 >> [] __ip_local_out+0xb3/0xbf >> [] ip_local_out+0x1c/0x36 >> [] ip_send_skb+0x19/0x3d >> [] udp_send_skb+0x17e/0x1df >> [] udp_sendmsg+0x5a2/0x77c >> [] inet_sendmsg+0x37/0x5e >> >> I don't touch chronyd in my VM, so I have no idea why it sends out UDP >> packets, my guess is it is some periodical packet. >> >> I don't think I use conntrack either, since /proc/net/ip_conntrack >> does not exist. > > You probably do, can you try "cat /proc/net/nf_conntrack" instead? > > (otherwise there should be no ipv4_conntrack_local() invocation > since we would not register this hook at all). Yeah it is very weird but it is true: [root@localhost ~]# echo scan > /sys/kernel/debug/kmemleak [ 133.450823] kmemleak: 18 new suspected memory leaks (see /sys/kernel/debug/kmemleak) [root@localhost ~]# cat /proc/net/ip_conntrack cat: /proc/net/ip_conntrack: No such file or directory [root@localhost ~]# cat /sys/kernel/debug/kmemleak unreferenced object 0x95c1e0b24040 (size 336): ... > > I tried to reproduce this but so far I had no success. > If you can identify something that could give a hint when this > is happening (only once after boot, periodically, only with udp, etc) > please let us know. > > (A reproducer would be even better of course ;-) ) Actually, it is even simpler to reproduce, nothing is needed but wait. I thought it is somewhat triggered by my tests, but actually no. For me, just boot the VM and wait for several seconds, memleak will show up. (chronyd is started by systemd during boot, not me.) > > Is this with current net tree? Yes, I just pulled DaveM's net tree and recompiled the kernel, still 100% reproducible here.
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 19:27 +0300, Denys Fedoryshchenko wrote: > On 2017-09-13 19:16, Eric Dumazet wrote: > > On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote: > >> Well, probably i am answering my own question, removing estimator from > >> classes seems drastically improve situation. > >> It seems estimator has some issues that cause shaper to behave > >> incorrectly (throttling traffic while it should not). > >> But i guess thats a bug? > >> As i was not able to predict such bottleneck by CPU load measurements. > > > > Well, there was a reason we disabled HTB class estimators by default ;) > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9 > > As soon as disabling it solve my problem - i'm fine, hehe, but i guess > other people who might hit this problem, should be aware how to find > reason. > They should not be disappointed in Linux :) Well, if they enable rate estimators while kernel does not set them by default, they get what they want, at a cost. > Because i can't measure this bottleneck before it happens, i'm seeing on > mpstat all cpu's are idle, and same time traffic is throttled. Normally things were supposed to get much better in linux-4.10 ( https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=1c0d32fde5bdf1184bc274f864c09799278a1114 ) But I apparently added a scaling bug. I will try : diff --git a/net/core/gen_estimator.c b/net/core/gen_estimator.c index 0385dece1f6fe5e26df1ce5f40956a79a2eebbf4..7c1ffd6f950172c1915d8e5fa2b5e3f77e4f4c78 100644 --- a/net/core/gen_estimator.c +++ b/net/core/gen_estimator.c @@ -83,10 +83,10 @@ static void est_timer(unsigned long arg) u64 rate, brate; est_fetch_counters(est, ); - brate = (b.bytes - est->last_bytes) << (8 - est->ewma_log); + brate = (b.bytes - est->last_bytes) << (10 - est->ewma_log - est->intvl_log); brate -= (est->avbps >> est->ewma_log); - rate = (u64)(b.packets - est->last_packets) << (8 - est->ewma_log); + rate = (u64)(b.packets - est->last_packets) << (10 - est->ewma_log - est->intvl_log); rate -= (est->avpps >> est->ewma_log); write_seqcount_begin(>seq); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
Re: [PATCH net 0/3] nfp: wait more carefully for card init
From: Jakub KicinskiDate: Wed, 13 Sep 2017 08:51:28 -0700 > The first patch is a small fix for flower offload, we need a whitelist > of supported matches, otherwise the unsupported ones will be ignored. > > The second and the third patch are adding wait/polling to the probe path. > We had reports of driver failing probe because it couldn't find the > control process (NSP) on the card. Turns out the NSP will only announce > its existence after it's fully initialized. Until now we assumed it > will be reachable, just not processing commands (hence we wait for > a NOOP command to execute successfully). Please build test your changes and look at what the compiler says: drivers/net/ethernet/netronome/nfp/nfp_main.c: In function ‘nfp_fw_unload’: drivers/net/ethernet/netronome/nfp/nfp_main.c:395:10: warning: ‘return’ with a value, in function returning void return err; ^~~ drivers/net/ethernet/netronome/nfp/nfp_main.c:388:13: note: declared here static void nfp_fw_unload(struct nfp_pf *pf) ^
Re: [patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
From: Jiri PirkoDate: Wed, 13 Sep 2017 17:32:37 +0200 > From: Jiri Pirko > > Recent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed > freeing in call_rcu, which changed already existing hard-to-hit > race condition into 100% hit: > > [ 598.599825] BUG: unable to handle kernel NULL pointer dereference at > 0030 > [ 598.607782] IP: tcf_action_destroy+0xc0/0x140 > > Or: > > [ 40.858924] BUG: unable to handle kernel NULL pointer dereference at > 0030 > [ 40.862840] IP: tcf_generic_walker+0x534/0x820 > > Fix this by storing the ops and use them directly for module_put call. > > Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") > Signed-off-by: Jiri Pirko Applied, thanks Jiri.
Re: [PATCH net] be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx
From: Suresh ReddyDate: Wed, 13 Sep 2017 11:12:42 -0400 > IPv6 TSO requests with extension hdrs are a problem to the > Lancer and BEx chips. Workaround is to disable TSO6 feature > for such packets. > > Also in Lancer chips, MSS less than 256 was resulting in TX stall. > Fix this by disabling GSO when MSS less than 256. > > Signed-off-by: Suresh Reddy Applied.
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On 2017-09-13 19:16, Eric Dumazet wrote: On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote: Well, probably i am answering my own question, removing estimator from classes seems drastically improve situation. It seems estimator has some issues that cause shaper to behave incorrectly (throttling traffic while it should not). But i guess thats a bug? As i was not able to predict such bottleneck by CPU load measurements. Well, there was a reason we disabled HTB class estimators by default ;) https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9 As soon as disabling it solve my problem - i'm fine, hehe, but i guess other people who might hit this problem, should be aware how to find reason. They should not be disappointed in Linux :) Because i can't measure this bottleneck before it happens, i'm seeing on mpstat all cpu's are idle, and same time traffic is throttled.
Re: [PATCH net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
From: Dan CarpenterDate: Wed, 13 Sep 2017 12:20:28 +0300 > @@ -154,7 +154,11 @@ static inline int sctp_ulpevent_type_enabled(__u16 > sn_type, >struct sctp_event_subscribe *mask) > { > char *amask = (char *) mask; > - return amask[sn_type - SCTP_SN_TYPE_BASE]; > + int offset = sn_type - SCTP_SN_TYPE_BASE; Please use reverse-christmas-tree local variable ordering. Thank you.
Re: [PATCH 06/10] drivers:ethernet: return -ENOMEM on allocation failure.
From: Allen PaisDate: Wed, 13 Sep 2017 13:02:15 +0530 > Signed-off-by: Allen Pais This is quite pointless as the caller doesn't do anything with the value, it just tests whether a negative value is returned or not.
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 18:34 +0300, Denys Fedoryshchenko wrote: > Well, probably i am answering my own question, removing estimator from > classes seems drastically improve situation. > It seems estimator has some issues that cause shaper to behave > incorrectly (throttling traffic while it should not). > But i guess thats a bug? > As i was not able to predict such bottleneck by CPU load measurements. Well, there was a reason we disabled HTB class estimators by default ;) https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=64153ce0a7b61b2a5cacb01805cbf670142339e9
Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB
On 09/13/2017 07:02 PM, Sergei Shtylyov wrote: Add a new compatible string for the R8A77995 (R-Car D3) RAVB. Signed-off-by: Yoshihiro Shimoda--- Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 1672353..ae2213f 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -17,6 +17,7 @@ Required properties: - "renesas,etheravb-r8a7795" for the R8A7795 SoC. - "renesas,etheravb-r8a7796" for the R8A7796 SoC. + - "renesas,etheravb-r8a77995" for the R8A77995 SoC. That would conflict with the R8A77970 bindings patch posted by me yesterday. Please respin atop of it. Here's the link for your convenience: https://marc.info/?l=linux-renesas-soc=150524655515476 [...] MBR, Sergei
Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB
Hello! On 09/13/2017 03:17 PM, Yoshihiro Shimoda wrote: Add a new compatible string for the R8A77995 (R-Car D3) RAVB. Signed-off-by: Yoshihiro Shimoda--- Documentation/devicetree/bindings/net/renesas,ravb.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt index 1672353..ae2213f 100644 --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt @@ -17,6 +17,7 @@ Required properties: - "renesas,etheravb-r8a7795" for the R8A7795 SoC. - "renesas,etheravb-r8a7796" for the R8A7796 SoC. + - "renesas,etheravb-r8a77995" for the R8A77995 SoC. That would conflict with the R8A77970 bindings patch posted by me yesterday. Please respin atop of it. - "renesas,etheravb-rcar-gen3" as a fallback for the above R-Car Gen3 devices. One more fragment is needed here, about the mandatory 'interrupt-names" prop. My patch makes this change unnecessary, however... MBR, Sergei
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On 2017-09-13 18:51, Eric Dumazet wrote: On Wed, 2017-09-13 at 18:20 +0300, Denys Fedoryshchenko wrote: Hi, I noticed after increasing bandwidth over some amount HTB started to throttle classes it should not throttle. Also estimated rate in htb totally wrong, while byte counters is correct. Is there any overflow or something? Thanks Denys for the report, I will take a look at this, since I probably introduced some regression. It's definitely not something recent, this system was on older kernel with uptime over 200 days, and this bottleneck was present, i noticed it long time before. But never tried to remove estimators (increasing burst/cburst to insane values saved me for a while).
[PATCH net 3/3] nfp: wait for the NSP resource to appear on boot
The control process (NSP) may take some time to complete its initialization. This is not a problem on most servers, but on very fast-booting machines it may not be ready for operation when driver probes the device. There is also a version of the flash in the wild where NSP tries to train the links as part of init. To wait for NSP initialization we should make sure its resource has already been added to the resource table. NSP adds itself there as last step of init. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- drivers/net/ethernet/netronome/nfp/nfp_main.c | 4 ++ drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h | 2 + .../ethernet/netronome/nfp/nfpcore/nfp_resource.c | 45 ++ 3 files changed, 51 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index 424707d41fbd..1c17b261a21c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -390,6 +390,10 @@ static void nfp_fw_unload(struct nfp_pf *pf) struct nfp_nsp *nsp; int err; + err = nfp_resource_wait(pf->cpp, NFP_RESOURCE_NSP, 30); + if (err) + return err; + nsp = nfp_nsp_open(pf->cpp); if (IS_ERR(nsp)) { nfp_err(pf->cpp, "Reset failed, can't open NSP\n"); diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h index 1a8d04a1e113..3ce51f03126f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h @@ -97,6 +97,8 @@ nfp_resource_acquire(struct nfp_cpp *cpp, const char *name); void nfp_resource_release(struct nfp_resource *res); +int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs); + u32 nfp_resource_cpp_id(struct nfp_resource *res); const char *nfp_resource_name(struct nfp_resource *res); diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c index 072612263dab..b1dd13ff282b 100644 --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c @@ -249,6 +249,51 @@ void nfp_resource_release(struct nfp_resource *res) kfree(res); } +/** + * nfp_resource_wait() - Wait for resource to appear + * @cpp: NFP CPP handle + * @name: Name of the resource + * @secs: Number of seconds to wait + * + * Wait for resource to appear in the resource table, grab and release + * its lock. The wait is jiffies-based, don't expect fine granularity. + * + * Return: 0 on success, errno otherwise. + */ +int nfp_resource_wait(struct nfp_cpp *cpp, const char *name, unsigned int secs) +{ + unsigned long warn_at = jiffies + NFP_MUTEX_WAIT_FIRST_WARN * HZ; + unsigned long err_at = jiffies + secs * HZ; + struct nfp_resource *res; + + while (true) { + res = nfp_resource_acquire(cpp, name); + if (!IS_ERR(res)) { + nfp_resource_release(res); + return 0; + } + + if (PTR_ERR(res) != -ENOENT) { + nfp_err(cpp, "error waiting for resource %s: %ld\n", + name, PTR_ERR(res)); + return PTR_ERR(res); + } + if (time_is_before_eq_jiffies(err_at)) { + nfp_err(cpp, "timeout waiting for resource %s\n", name); + return -ETIMEDOUT; + } + if (time_is_before_eq_jiffies(warn_at)) { + warn_at = jiffies + NFP_MUTEX_WAIT_NEXT_WARN * HZ; + nfp_info(cpp, "waiting for NFP resource %s\n", name); + } + if (msleep_interruptible(10)) { + nfp_err(cpp, "wait for resource %s interrupted\n", + name); + return -ERESTARTSYS; + } + } +} + /** * nfp_resource_cpp_id() - Return the cpp_id of a resource handle * @res: NFP Resource handle -- 2.14.1
[PATCH net 1/3] nfp: add whitelist of supported flow dissector
From: Pieter Jansen van VuurenPreviously we did not check the flow dissector against a list of allowed and supported flow key dissectors. This patch introduces such a list and correctly rejects unsupported flow keys. Fixes: 43f84b72c50d ("nfp: add metadata to each flow offload") Signed-off-by: Pieter Jansen van Vuuren Reviewed-by: Simon Horman Reviewed-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/flower/offload.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index d396183108f7..a18b4d2b1d3e 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -44,6 +44,16 @@ #include "../nfp_net.h" #include "../nfp_port.h" +#define NFP_FLOWER_WHITELIST_DISSECTOR \ + (BIT(FLOW_DISSECTOR_KEY_CONTROL) | \ +BIT(FLOW_DISSECTOR_KEY_BASIC) | \ +BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_PORTS) | \ +BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) | \ +BIT(FLOW_DISSECTOR_KEY_VLAN) | \ +BIT(FLOW_DISSECTOR_KEY_IP)) + static int nfp_flower_xmit_flow(struct net_device *netdev, struct nfp_fl_payload *nfp_flow, u8 mtype) @@ -112,6 +122,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls, u8 key_layer; int key_size; + if (flow->dissector->used_keys & ~NFP_FLOWER_WHITELIST_DISSECTOR) + return -EOPNOTSUPP; + if (dissector_uses_key(flow->dissector, FLOW_DISSECTOR_KEY_ENC_CONTROL)) { struct flow_dissector_key_control *mask_enc_ctl = -- 2.14.1
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
On Wed, 2017-09-13 at 18:20 +0300, Denys Fedoryshchenko wrote: > Hi, > > I noticed after increasing bandwidth over some amount HTB started to > throttle classes it should not throttle. > Also estimated rate in htb totally wrong, while byte counters is > correct. > > Is there any overflow or something? Thanks Denys for the report, I will take a look at this, since I probably introduced some regression.
[PATCH net 2/3] nfp: wait for board state before talking to the NSP
Board state informs us which low-level initialization stages the card has completed. We should wait for the card to be fully initialized before trying to communicate with it, not only before we configure passing traffic. Signed-off-by: Jakub KicinskiReviewed-by: Simon Horman --- drivers/net/ethernet/netronome/nfp/nfp_main.c | 43 +++ drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 2 files changed, 43 insertions(+), 23 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index f055b1774d65..424707d41fbd 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -74,6 +74,45 @@ static const struct pci_device_id nfp_pci_device_ids[] = { }; MODULE_DEVICE_TABLE(pci, nfp_pci_device_ids); +static bool nfp_board_ready(struct nfp_pf *pf) +{ + const char *cp; + long state; + int err; + + cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state"); + if (!cp) + return false; + + err = kstrtol(cp, 0, ); + if (err < 0) + return false; + + return state == 15; +} + +static int nfp_pf_board_state_wait(struct nfp_pf *pf) +{ + const unsigned long wait_until = jiffies + 10 * HZ; + + while (!nfp_board_ready(pf)) { + if (time_is_before_eq_jiffies(wait_until)) { + nfp_err(pf->cpp, "NFP board initialization timeout\n"); + return -EINVAL; + } + + nfp_info(pf->cpp, "waiting for board initialization\n"); + if (msleep_interruptible(500)) + return -ERESTARTSYS; + + /* Refresh cached information */ + kfree(pf->hwinfo); + pf->hwinfo = nfp_hwinfo_read(pf->cpp); + } + + return 0; +} + static int nfp_pcie_sriov_read_nfd_limit(struct nfp_pf *pf) { int err; @@ -425,6 +464,10 @@ static int nfp_pci_probe(struct pci_dev *pdev, nfp_hwinfo_lookup(pf->hwinfo, "assembly.revision"), nfp_hwinfo_lookup(pf->hwinfo, "cpld.version")); + err = nfp_pf_board_state_wait(pf); + if (err) + goto err_hwinfo_free; + err = devlink_register(devlink, >dev); if (err) goto err_hwinfo_free; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c index 5abb9ba31e7d..ff373acd28f3 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_main.c @@ -64,23 +64,6 @@ #define NFP_PF_CSR_SLICE_SIZE (32 * 1024) -static int nfp_is_ready(struct nfp_pf *pf) -{ - const char *cp; - long state; - int err; - - cp = nfp_hwinfo_lookup(pf->hwinfo, "board.state"); - if (!cp) - return 0; - - err = kstrtol(cp, 0, ); - if (err < 0) - return 0; - - return state == 15; -} - /** * nfp_net_get_mac_addr() - Get the MAC address. * @pf: NFP PF handle @@ -725,12 +708,6 @@ int nfp_net_pci_probe(struct nfp_pf *pf) INIT_WORK(>port_refresh_work, nfp_net_refresh_vnics); - /* Verify that the board has completed initialization */ - if (!nfp_is_ready(pf)) { - nfp_err(pf->cpp, "NFP is not ready for NIC operation.\n"); - return -EINVAL; - } - if (!pf->rtbl) { nfp_err(pf->cpp, "No %s, giving up.\n", pf->fw_loaded ? "symbol table" : "firmware found"); -- 2.14.1
[PATCH net 0/3] nfp: wait more carefully for card init
Hi! The first patch is a small fix for flower offload, we need a whitelist of supported matches, otherwise the unsupported ones will be ignored. The second and the third patch are adding wait/polling to the probe path. We had reports of driver failing probe because it couldn't find the control process (NSP) on the card. Turns out the NSP will only announce its existence after it's fully initialized. Until now we assumed it will be reachable, just not processing commands (hence we wait for a NOOP command to execute successfully). Jakub Kicinski (2): nfp: wait for board state before talking to the NSP nfp: wait for the NSP resource to appear on boot Pieter Jansen van Vuuren (1): nfp: add whitelist of supported flow dissector .../net/ethernet/netronome/nfp/flower/offload.c| 13 ++ drivers/net/ethernet/netronome/nfp/nfp_main.c | 47 ++ drivers/net/ethernet/netronome/nfp/nfp_net_main.c | 23 --- drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h | 2 + .../ethernet/netronome/nfp/nfpcore/nfp_resource.c | 45 + 5 files changed, 107 insertions(+), 23 deletions(-) -- 2.14.1
Re: 319554f284dd ("inet: don't use sk_v6_rcv_saddr directly") causes bind port regression
On 09/12/2017 04:12 PM, Josef Bacik wrote: First I’m super sorry for the top post, I’m at plumbers and I forgot to upload my muttrc to my new cloud instance, so I’m screwed using outlook. I have a completely untested, uncompiled patch that I think will fix the problem, would you mind giving it a go? Thanks, Josef Thanks for the quick turnaround. Unfortunately, the problem is still reproducible according to the reporter. Thanks, Laura
Re: Memory leaks in conntrack
Florian Westphalwrote: > Cong Wang wrote: > > While testing my TC filter patches (so not related to conntrack), the > > following memory leaks are shown up: > > > > unreferenced object 0x9b19ba551228 (size 128): > > comm "chronyd", pid 338, jiffies 4294910829 (age 53.188s) > > hex dump (first 32 bytes): > > 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b > > 00 00 00 00 18 00 00 30 00 00 00 00 00 00 00 00 ...0 > > backtrace: > > [] create_object+0x169/0x2aa > > [] kmemleak_alloc+0x25/0x41 > > [] slab_post_alloc_hook+0x44/0x65 > > [] __kmalloc_track_caller+0x113/0x146 > > [] __krealloc+0x4a/0x69 > > [] nf_ct_ext_add+0xe1/0x145 > > [] init_conntrack+0x1f7/0x36e > > [] nf_conntrack_in+0x1d3/0x326 > > [] ipv4_conntrack_local+0x4d/0x50 > > [] nf_hook_slow+0x3c/0x9b > > [] nf_hook.constprop.40+0xbe/0xd8 > > [] __ip_local_out+0xb3/0xbf > > [] ip_local_out+0x1c/0x36 > > [] ip_send_skb+0x19/0x3d > > [] udp_send_skb+0x17e/0x1df > > [] udp_sendmsg+0x5a2/0x77c > > unreferenced object 0x9b19a69b3340 (size 336): > > comm "chronyd", pid 338, jiffies 4294910868 (age 53.032s) > > hex dump (first 32 bytes): > > 01 00 00 00 5a 5a 5a 5a 00 00 00 00 ad 4e ad de .N.. > > ff ff ff ff 5a 5a 5a 5a ff ff ff ff ff ff ff ff > > backtrace: > > [] create_object+0x169/0x2aa > > [] kmemleak_alloc+0x25/0x41 > > [] slab_post_alloc_hook+0x44/0x65 > > [] kmem_cache_alloc+0xd7/0x1f1 > > [] __nf_conntrack_alloc+0xa2/0x146 > > [] init_conntrack+0xb2/0x36e > > [] nf_conntrack_in+0x1d3/0x326 > > [] ipv4_conntrack_local+0x4d/0x50 > > [] nf_hook_slow+0x3c/0x9b > > [] nf_hook.constprop.40+0xbe/0xd8 > > [] __ip_local_out+0xb3/0xbf > > [] ip_local_out+0x1c/0x36 > > [] ip_send_skb+0x19/0x3d > > [] udp_send_skb+0x17e/0x1df > > [] udp_sendmsg+0x5a2/0x77c > > [] inet_sendmsg+0x37/0x5e > > > > I don't touch chronyd in my VM, so I have no idea why it sends out UDP > > packets, my guess is it is some periodical packet. > > > > I don't think I use conntrack either, since /proc/net/ip_conntrack > > does not exist. > > You probably do, can you try "cat /proc/net/nf_conntrack" instead? > > (otherwise there should be no ipv4_conntrack_local() invocation > since we would not register this hook at all). > > I tried to reproduce this but so far I had no success. > If you can identify something that could give a hint when this > is happening (only once after boot, periodically, only with udp, etc) > please let us know. FWIW i managed to obtain a similar backtrace, but in that case it was a false positive (peeking at the address content showed it was my ssh connection to the vm and timeout and tcp conntrackk struct fields were changing; i.e. the nf_conn reported was still in the conntrack hash. Why this address was reported i do not know, afaik kmemleak does scan for addresses anywhere in the object (we use container_of() to get back nf_conn from the hlist_node), so it should have found the address linked via the main conntrack hash table. Right now I don't have enough info to dig any further, sorry :-/
Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?
On Wed, Sep 13, 2017 at 11:05 AM, Willem de Bruijnwrote: > On Wed, Sep 13, 2017 at 10:22 AM, Jan Engelhardt wrote: >> >> On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote: >>> >>>One way to fix is to have iptables open the object (using the stored >>>xt_bpf_info_v1->path), gaining a new process local fd for the object, >>>just after getting the rules from IPT_SO_GET_ENTRIES. >>>However we didn't see any other extensions doing something like that in >>>iptables. > > The binary should call bpf_obj_get on the filepath each time. These are > not regular files, but references to a pinned object in the bpf filesystem. > > Blindly passing back the fd received from the kernel is clearly wrong. I'm > really surprised that I did not run into this problem when I wrote the > feature. > >>> >>>Another way to solve is to fix the ABI (or have a v2 one), that does NOT >>>pass the fd from userspace, only the path of the pinned object. >>>Then, 'bpf_mt_check_v1' will open the file from the given path in order >>>to get the bpf_prog. >> >> But a path has a similar problem like a file descriptor - it is local to a >> certain mount namespace. > > Because these are pinned objects in the bpf filesystem, and there is > only one of those, it may be possible to lookup the object in the kernel > without relying on a process-local view of mount points. It would be preferable to fix this for the existing v1 users as well. That said, the new bpf identifier feature allows passing a globally unique id instead of a filepath. > >> >> To load "large" blobs into the kernel, a pointer to user memory is a possible >> option. The downside is that such extra data is not retrievable from the >> kernel >> via the iptables setsockopts anymore - one could work around it with procfs, >> or >> just let it be. >> >> https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c >> line 64+.
Re: HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
Well, probably i am answering my own question, removing estimator from classes seems drastically improve situation. It seems estimator has some issues that cause shaper to behave incorrectly (throttling traffic while it should not). But i guess thats a bug? As i was not able to predict such bottleneck by CPU load measurements. On 2017-09-13 18:20, Denys Fedoryshchenko wrote: Hi, I noticed after increasing bandwidth over some amount HTB started to throttle classes it should not throttle. Also estimated rate in htb totally wrong, while byte counters is correct. Is there any overflow or something? X520 card (but XL710 same) br1 8000.90e2ba86c38c no eth3.1777 eth3.777 br2 8000.90e2ba86c38d no eth3.360 eth3.361 Inbound traffic is coming over one vlan, leaving another vlan. Shaper is just bunch of classes and u32 filters, with few fw filters. qdisc is pie I put totally high values to not reach them, tried to change quantum/burst/cburst but... stats below. First, "root" class is 1:1 showing rate 18086Mbit, which is physically impossible. Class 1:111 showing 5355Mbit, while real traffic is ~1.5Gbit shaper /etc # tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class show dev eth3.777 classid 1:111 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 6487632263 bytes 5235525 pkt (dropped 0, overlimits 0 requeues 0) rate 5529Mbit 557534pps backlog 0b 0p requeues 0 lended: 2423323 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 7438601731 bytes 6003811 pkt (dropped 0, overlimits 0 requeues 0) rate 5631Mbit 568214pps backlog 36624b 8p requeues 0 lended: 2772486 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 (7438601731-6487632263)/5*8 = 1.521.551.148 And most important some classes suffering, while they should not (not reaching limits) class htb 1:95 parent 1:1 leaf 95: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 13556762059 bytes 17474559 pkt (dropped 16017, overlimits 0 requeues 0) rate 2524Mbit 414197pps backlog 31969245b 34513p requeues 0 lended: 13995723 borrowed: 0 giants: 0 tokens: 111 ctokens: -2 Full classes stats: class htb 1:100 parent 1:1 leaf 100: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 116 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) rate 8bit 0pps backlog 0b 0p requeues 0 lended: 2 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 class htb 1:120 parent 1:1 leaf 120: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 531230043 bytes 782130 pkt (dropped 0, overlimits 0 requeues 0) rate 132274Kbit 25240pps backlog 0b 0p requeues 0 lended: 540693 borrowed: 0 giants: 0 tokens: 109 ctokens: -2 class htb 1:50 parent 1:1 leaf 50: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 773472109 bytes 587335 pkt (dropped 0, overlimits 0 requeues 0) rate 215929Kbit 20503pps backlog 0b 0p requeues 0 lended: 216614 borrowed: 0 giants: 0 tokens: 91 ctokens: -4 class htb 1:70 parent 1:1 leaf 70: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 1574768 bytes 6194 pkt (dropped 0, overlimits 0 requeues 0) rate 406272bit 214pps backlog 0b 0p requeues 0 lended: 5758 borrowed: 0 giants: 0 tokens: 101 ctokens: -3 class htb 1:90 parent 1:1 leaf 90: prio 0 quantum 5 rate 1Kbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 3206 bytes 53 pkt (dropped 0, overlimits 0 requeues 0) rate 848bit 1pps backlog 0b 0p requeues 0 lended: 53 borrowed: 0 giants: 0 class htb 1:110 parent 1:1 leaf 110: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 17205952113 bytes 12926008 pkt (dropped 239, overlimits 0 requeues 0) rate 4433Mbit 416825pps backlog 5847785b 2394p requeues 0 lended: 7021696 borrowed: 0 giants: 0 tokens: 91 ctokens: -4 class htb 1:45 root leaf 45: prio 0 quantum 5 rate 80Mbit ceil 80Mbit linklayer ethernet burst 1b/1 mpu 0b cburst 1b/1 mpu 0b level 0 Sent 2586 bytes 45 pkt (dropped 0, overlimits 0 requeues 0) rate 456bit 1pps backlog 0b 0p requeues 0 lended: 45 borrowed: 0 giants: 0 tokens: 15540 ctokens: 15540 class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 0b/1 mpu 0b cburst 0b/1 mpu 0b level 7 Sent 72277215121 bytes
[patch net] net: sched: fix use-after-free in tcf_action_destroy and tcf_del_walker
From: Jiri PirkoRecent commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") removed freeing in call_rcu, which changed already existing hard-to-hit race condition into 100% hit: [ 598.599825] BUG: unable to handle kernel NULL pointer dereference at 0030 [ 598.607782] IP: tcf_action_destroy+0xc0/0x140 Or: [ 40.858924] BUG: unable to handle kernel NULL pointer dereference at 0030 [ 40.862840] IP: tcf_generic_walker+0x534/0x820 Fix this by storing the ops and use them directly for module_put call. Fixes: a85a970af265 ("net_sched: move tc_action into tcf_common") Signed-off-by: Jiri Pirko --- net/sched/act_api.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/sched/act_api.c b/net/sched/act_api.c index fcd7dc7..da6fa82 100644 --- a/net/sched/act_api.c +++ b/net/sched/act_api.c @@ -180,7 +180,7 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, struct sk_buff *skb, idr_for_each_entry_ext(idr, p, id) { ret = __tcf_idr_release(p, false, true); if (ret == ACT_P_DELETED) { - module_put(p->ops->owner); + module_put(ops->owner); n_i++; } else if (ret < 0) { goto nla_put_failure; @@ -514,13 +514,15 @@ EXPORT_SYMBOL(tcf_action_exec); int tcf_action_destroy(struct list_head *actions, int bind) { + const struct tc_action_ops *ops; struct tc_action *a, *tmp; int ret = 0; list_for_each_entry_safe(a, tmp, actions, list) { + ops = a->ops; ret = __tcf_idr_release(a, bind, true); if (ret == ACT_P_DELETED) - module_put(a->ops->owner); + module_put(ops->owner); else if (ret < 0) return ret; } -- 2.9.3
[PATCH stable-3.16 2/3] VSOCK: Fix lockdep issue.
From: Jorgen Hansencommit 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a upstream. The recent fix for the vsock sock_put issue used the wrong initializer for the transport spin_lock causing an issue when running with lockdep checking. Testing: Verified fix on kernel with lockdep enabled. Reviewed-by: Thomas Hellstrom Signed-off-by: Jorgen Hansen Signed-off-by: David S. Miller Signed-off-by: Michal Hocko --- net/vmw_vsock/vmci_transport.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c index aed136d27b01..314312272e08 100644 --- a/net/vmw_vsock/vmci_transport.c +++ b/net/vmw_vsock/vmci_transport.c @@ -1570,7 +1570,7 @@ static int vmci_transport_socket_init(struct vsock_sock *vsk, vmci_trans(vsk)->notify_ops = NULL; INIT_LIST_HEAD(_trans(vsk)->elem); vmci_trans(vsk)->sk = >sk; - vmci_trans(vsk)->lock = __SPIN_LOCK_UNLOCKED(vmci_trans(vsk)->lock); + spin_lock_init(_trans(vsk)->lock); if (psk) { vmci_trans(vsk)->queue_pair_size = vmci_trans(psk)->queue_pair_size; -- 2.14.1
[PATCH stable-3.16 1/3] VSOCK: sock_put wasn't safe to call in interrupt context
From: Jorgen Hansencommit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream. In the vsock vmci_transport driver, sock_put wasn't safe to call in interrupt context, since that may call the vsock destructor which in turn calls several functions that should only be called from process context. This change defers the callling of these functions to a worker thread. All these functions were deallocation of resources related to the transport itself. Furthermore, an unused callback was removed to simplify the cleanup. Multiple customers have been hitting this issue when using VMware tools on vSphere 2015. Also added a version to the vmci transport module (starting from 1.0.2.0-k since up until now it appears that this module was sharing version with vsock that is currently at 1.0.1.0-k). Reviewed-by: Aditya Asarwade Reviewed-by: Thomas Hellstrom Signed-off-by: Jorgen Hansen Signed-off-by: David S. Miller Signed-off-by: Michal Hocko --- net/vmw_vsock/vmci_transport.c | 173 - net/vmw_vsock/vmci_transport.h | 4 +- 2 files changed, 86 insertions(+), 91 deletions(-) diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c index 9bb63ffec4f2..aed136d27b01 100644 --- a/net/vmw_vsock/vmci_transport.c +++ b/net/vmw_vsock/vmci_transport.c @@ -40,13 +40,11 @@ static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg); static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg); -static void vmci_transport_peer_attach_cb(u32 sub_id, - const struct vmci_event_data *ed, - void *client_data); static void vmci_transport_peer_detach_cb(u32 sub_id, const struct vmci_event_data *ed, void *client_data); static void vmci_transport_recv_pkt_work(struct work_struct *work); +static void vmci_transport_cleanup(struct work_struct *work); static int vmci_transport_recv_listen(struct sock *sk, struct vmci_transport_packet *pkt); static int vmci_transport_recv_connecting_server( @@ -75,6 +73,10 @@ struct vmci_transport_recv_pkt_info { struct vmci_transport_packet pkt; }; +static LIST_HEAD(vmci_transport_cleanup_list); +static DEFINE_SPINLOCK(vmci_transport_cleanup_lock); +static DECLARE_WORK(vmci_transport_cleanup_work, vmci_transport_cleanup); + static struct vmci_handle vmci_transport_stream_handle = { VMCI_INVALID_ID, VMCI_INVALID_ID }; static u32 vmci_transport_qp_resumed_sub_id = VMCI_INVALID_ID; @@ -791,44 +793,6 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg) return err; } -static void vmci_transport_peer_attach_cb(u32 sub_id, - const struct vmci_event_data *e_data, - void *client_data) -{ - struct sock *sk = client_data; - const struct vmci_event_payload_qp *e_payload; - struct vsock_sock *vsk; - - e_payload = vmci_event_data_const_payload(e_data); - - vsk = vsock_sk(sk); - - /* We don't ask for delayed CBs when we subscribe to this event (we -* pass 0 as flags to vmci_event_subscribe()). VMCI makes no -* guarantees in that case about what context we might be running in, -* so it could be BH or process, blockable or non-blockable. So we -* need to account for all possible contexts here. -*/ - local_bh_disable(); - bh_lock_sock(sk); - - /* XXX This is lame, we should provide a way to lookup sockets by -* qp_handle. -*/ - if (vmci_handle_is_equal(vmci_trans(vsk)->qp_handle, -e_payload->handle)) { - /* XXX This doesn't do anything, but in the future we may want -* to set a flag here to verify the attach really did occur and -* we weren't just sent a datagram claiming it was. -*/ - goto out; - } - -out: - bh_unlock_sock(sk); - local_bh_enable(); -} - static void vmci_transport_handle_detach(struct sock *sk) { struct vsock_sock *vsk; @@ -871,28 +835,38 @@ static void vmci_transport_peer_detach_cb(u32 sub_id, const struct vmci_event_data *e_data, void *client_data) { - struct sock *sk = client_data; + struct vmci_transport *trans = client_data; const struct vmci_event_payload_qp *e_payload; - struct vsock_sock *vsk; e_payload = vmci_event_data_const_payload(e_data); - vsk = vsock_sk(sk); - if
[PATCH stable-3.16 3/3] VSOCK: Detach QP check should filter out non matching QPs.
From: Jorgen Hansencommit 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 upstream. The check in vmci_transport_peer_detach_cb should only allow a detach when the qp handle of the transport matches the one in the detach message. Testing: Before this change, a detach from a peer on a different socket would cause an active stream socket to register a detach. Reviewed-by: George Zhang Signed-off-by: Jorgen Hansen Signed-off-by: David S. Miller Signed-off-by: Michal Hocko --- net/vmw_vsock/vmci_transport.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c index 314312272e08..c69c990ec4a2 100644 --- a/net/vmw_vsock/vmci_transport.c +++ b/net/vmw_vsock/vmci_transport.c @@ -844,7 +844,7 @@ static void vmci_transport_peer_detach_cb(u32 sub_id, * qp_handle. */ if (vmci_handle_is_invalid(e_payload->handle) || - vmci_handle_is_equal(trans->qp_handle, e_payload->handle)) + !vmci_handle_is_equal(trans->qp_handle, e_payload->handle)) return; /* We don't ask for delayed CBs when we subscribe to this event (we @@ -2158,7 +2158,7 @@ module_exit(vmci_transport_exit); MODULE_AUTHOR("VMware, Inc."); MODULE_DESCRIPTION("VMCI transport for Virtual Sockets"); -MODULE_VERSION("1.0.2.0-k"); +MODULE_VERSION("1.0.3.0-k"); MODULE_LICENSE("GPL v2"); MODULE_ALIAS("vmware_vsock"); MODULE_ALIAS_NETPROTO(PF_VSOCK); -- 2.14.1
HTB going crazy over ~5Gbit/s (4.12.9, but problem present in older kernels as well)
Hi, I noticed after increasing bandwidth over some amount HTB started to throttle classes it should not throttle. Also estimated rate in htb totally wrong, while byte counters is correct. Is there any overflow or something? X520 card (but XL710 same) br1 8000.90e2ba86c38c no eth3.1777 eth3.777 br2 8000.90e2ba86c38d no eth3.360 eth3.361 Inbound traffic is coming over one vlan, leaving another vlan. Shaper is just bunch of classes and u32 filters, with few fw filters. qdisc is pie I put totally high values to not reach them, tried to change quantum/burst/cburst but... stats below. First, "root" class is 1:1 showing rate 18086Mbit, which is physically impossible. Class 1:111 showing 5355Mbit, while real traffic is ~1.5Gbit shaper /etc # tc -s -d class show dev eth3.777 classid 1:111;sleep 5;tc -s -d class show dev eth3.777 classid 1:111 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 6487632263 bytes 5235525 pkt (dropped 0, overlimits 0 requeues 0) rate 5529Mbit 557534pps backlog 0b 0p requeues 0 lended: 2423323 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 7438601731 bytes 6003811 pkt (dropped 0, overlimits 0 requeues 0) rate 5631Mbit 568214pps backlog 36624b 8p requeues 0 lended: 2772486 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 (7438601731-6487632263)/5*8 = 1.521.551.148 And most important some classes suffering, while they should not (not reaching limits) class htb 1:95 parent 1:1 leaf 95: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 13556762059 bytes 17474559 pkt (dropped 16017, overlimits 0 requeues 0) rate 2524Mbit 414197pps backlog 31969245b 34513p requeues 0 lended: 13995723 borrowed: 0 giants: 0 tokens: 111 ctokens: -2 Full classes stats: class htb 1:100 parent 1:1 leaf 100: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 116 bytes 2 pkt (dropped 0, overlimits 0 requeues 0) rate 8bit 0pps backlog 0b 0p requeues 0 lended: 2 borrowed: 0 giants: 0 tokens: 124 ctokens: -1 class htb 1:120 parent 1:1 leaf 120: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 531230043 bytes 782130 pkt (dropped 0, overlimits 0 requeues 0) rate 132274Kbit 25240pps backlog 0b 0p requeues 0 lended: 540693 borrowed: 0 giants: 0 tokens: 109 ctokens: -2 class htb 1:50 parent 1:1 leaf 50: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 773472109 bytes 587335 pkt (dropped 0, overlimits 0 requeues 0) rate 215929Kbit 20503pps backlog 0b 0p requeues 0 lended: 216614 borrowed: 0 giants: 0 tokens: 91 ctokens: -4 class htb 1:70 parent 1:1 leaf 70: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 1574768 bytes 6194 pkt (dropped 0, overlimits 0 requeues 0) rate 406272bit 214pps backlog 0b 0p requeues 0 lended: 5758 borrowed: 0 giants: 0 tokens: 101 ctokens: -3 class htb 1:90 parent 1:1 leaf 90: prio 0 quantum 5 rate 1Kbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 3206 bytes 53 pkt (dropped 0, overlimits 0 requeues 0) rate 848bit 1pps backlog 0b 0p requeues 0 lended: 53 borrowed: 0 giants: 0 class htb 1:110 parent 1:1 leaf 110: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 17205952113 bytes 12926008 pkt (dropped 239, overlimits 0 requeues 0) rate 4433Mbit 416825pps backlog 5847785b 2394p requeues 0 lended: 7021696 borrowed: 0 giants: 0 tokens: 91 ctokens: -4 class htb 1:45 root leaf 45: prio 0 quantum 5 rate 80Mbit ceil 80Mbit linklayer ethernet burst 1b/1 mpu 0b cburst 1b/1 mpu 0b level 0 Sent 2586 bytes 45 pkt (dropped 0, overlimits 0 requeues 0) rate 456bit 1pps backlog 0b 0p requeues 0 lended: 45 borrowed: 0 giants: 0 tokens: 15540 ctokens: 15540 class htb 1:1 root rate 100Gbit ceil 100Gbit linklayer ethernet burst 0b/1 mpu 0b cburst 0b/1 mpu 0b level 7 Sent 72277215121 bytes 72693012 pkt (dropped 0, overlimits 0 requeues 0) rate 18086Mbit 2304729pps backlog 0b 0p requeues 0 lended: 0 borrowed: 0 giants: 0 tokens: -4 ctokens: -4 class htb 1:111 parent 1:1 leaf 111: prio 0 quantum 5 rate 10Gbit ceil 100Gbit linklayer ethernet burst 1b/1 mpu 0b cburst 0b/1 mpu 0b level 0 Sent 21977384237 bytes
Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels
On Wed 13-09-17 15:07:26, Jorgen S. Hansen wrote: > > > On Sep 12, 2017, at 11:08 AM, Michal Hockowrote: > > > > Hi, > > we are seeing the following splat with Debian 3.16 stable kernel > > > > BUG: scheduling while atomic: MATLAB/26771/0x0100 > > Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver > > rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache > > sunrpc vmw_vso$ > > CPU: 0 PID: 26771 Comm: MATLAB Tainted: G O 3.16.0-4-amd64 #1 > > Debian 3.16.7-ckt20-1+deb8u3 > > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference > > Platform, BIOS 6.00 09/21/2015 > > 88315c1e4c20 8150db3f 88193f803dc8 8150acdf > > 815103a2 00012f00 8819423dbfd8 00012f00 > > 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0 > > Call Trace: > > [] ? dump_stack+0x41/0x51 > > [] ? __schedule_bug+0x48/0x55 > > [] ? __schedule+0x5d2/0x700 > > [] ? schedule_timeout+0x229/0x2a0 > > [] ? select_task_rq_fair+0x390/0x700 > > [] ? check_preempt_wakeup+0x120/0x1d0 > > [] ? wait_for_completion+0xa8/0x120 > > [] ? wake_up_state+0x10/0x10 > > [] ? call_rcu_bh+0x20/0x20 > > [] ? wait_rcu_gp+0x4b/0x60 > > [] ? ftrace_raw_output_rcu_utilization+0x40/0x40 > > [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci] > > [] ? vmci_transport_destruct+0x1d/0xe0 > > [vmw_vsock_vmci_transport] > > [] ? vsock_sk_destruct+0x13/0x60 [vsock] > > [] ? __sk_free+0x1a/0x130 > > [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 > > [vmw_vsock_vmci_transport] > > [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 > > [vmw_vmci] > > [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci] > > [] ? tasklet_action+0xf4/0x100 > > [] ? __do_softirq+0xf1/0x290 > > [] ? irq_exit+0x95/0xa0 > > [] ? do_IRQ+0x52/0xe0 > > [] ? common_interrupt+0x6d/0x6d > > > > AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe > > to call in interrupt context") but this patch hasn't been backported to > > stable trees. It applies cleanly on top of 3.16 stable tree but I am not > > familiar with the code to send the backport to the stable maintainer > > directly. > > > > Could you double check that the patch below (just a blind cherry-pick) > > is correct and it doesn't need additional patches on top? > > Hi, > > The patch below has been used to fix the above issue by other distros > - among them Redhat for the 3.10 kernel, so it should work for 3.16 as > well. Thanks for the confirmation. I do not see 4ef7ea9195ea ("VSOCK: sock_put wasn't safe to call in interrupt context") in 3.10 stable branch though. > In addition to the patch above, there are two other patches that > need to be applied on top for the fix to be correct: > > 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue." > > and > > 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should > filter out non matching QPs." Good to know. I will send all three patches cherry-picked on top of the current 3.16 stable branch. Could you have a look please? -- Michal Hocko SUSE Labs
[PATCH net] be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx
IPv6 TSO requests with extension hdrs are a problem to the Lancer and BEx chips. Workaround is to disable TSO6 feature for such packets. Also in Lancer chips, MSS less than 256 was resulting in TX stall. Fix this by disabling GSO when MSS less than 256. Signed-off-by: Suresh Reddy--- drivers/net/ethernet/emulex/benet/be.h | 8 drivers/net/ethernet/emulex/benet/be_main.c | 14 ++ 2 files changed, 22 insertions(+) diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h index 674cf9d..8984c49 100644 --- a/drivers/net/ethernet/emulex/benet/be.h +++ b/drivers/net/ethernet/emulex/benet/be.h @@ -930,6 +930,14 @@ static inline bool is_ipv4_pkt(struct sk_buff *skb) return skb->protocol == htons(ETH_P_IP) && ip_hdr(skb)->version == 4; } +static inline bool is_ipv6_ext_hdr(struct sk_buff *skb) +{ + if (ip_hdr(skb)->version == 6) + return ipv6_ext_hdr(ipv6_hdr(skb)->nexthdr); + else + return false; +} + #define be_error_recovering(adapter) \ (adapter->flags & BE_FLAGS_TRY_RECOVERY) diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c index 319eee3..0e3d9f39 100644 --- a/drivers/net/ethernet/emulex/benet/be_main.c +++ b/drivers/net/ethernet/emulex/benet/be_main.c @@ -5089,6 +5089,20 @@ static netdev_features_t be_features_check(struct sk_buff *skb, struct be_adapter *adapter = netdev_priv(dev); u8 l4_hdr = 0; + if (skb_is_gso(skb)) { + /* IPv6 TSO requests with extension hdrs are a problem +* to Lancer and BE3 HW. Disable TSO6 feature. +*/ + if (!skyhawk_chip(adapter) && is_ipv6_ext_hdr(skb)) + features &= ~NETIF_F_TSO6; + + /* Lancer cannot handle the packet with MSS less than 256. +* Disable the GSO support in such cases +*/ + if (lancer_chip(adapter) && skb_shinfo(skb)->gso_size < 256) + features &= ~NETIF_F_GSO_MASK; + } + /* The code below restricts offload features for some tunneled and * Q-in-Q packets. * Offload features for normal (non tunnel) packets are unchanged. -- 1.8.3.1
Re: [PATCH 10/10] fs:btrfs: return -ENOMEM on allocation failure.
On Wed, Sep 13, 2017 at 01:02:19PM +0530, Allen Pais wrote: > Signed-off-by: Allen Pais> --- > fs/btrfs/check-integrity.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c > index 7d5a9b5..efa4c23 100644 > --- a/fs/btrfs/check-integrity.c > +++ b/fs/btrfs/check-integrity.c > @@ -2913,7 +2913,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, > state = kvzalloc(sizeof(*state), GFP_KERNEL); > if (!state) { > pr_info("btrfs check-integrity: allocation failed!\n"); > - return -1; > + return -ENOMEM; Makes sense, also please fix the -1 a few lines below that also result from failed memory allocation, indirectly from btrfsic_dev_state_alloc(). > } > > if (!btrfsic_is_initialized) { > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels
> On Sep 12, 2017, at 11:08 AM, Michal Hockowrote: > > Hi, > we are seeing the following splat with Debian 3.16 stable kernel > > BUG: scheduling while atomic: MATLAB/26771/0x0100 > Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver > rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache > sunrpc vmw_vso$ > CPU: 0 PID: 26771 Comm: MATLAB Tainted: G O 3.16.0-4-amd64 #1 > Debian 3.16.7-ckt20-1+deb8u3 > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference > Platform, BIOS 6.00 09/21/2015 > 88315c1e4c20 8150db3f 88193f803dc8 8150acdf > 815103a2 00012f00 8819423dbfd8 00012f00 > 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0 > Call Trace: > [] ? dump_stack+0x41/0x51 > [] ? __schedule_bug+0x48/0x55 > [] ? __schedule+0x5d2/0x700 > [] ? schedule_timeout+0x229/0x2a0 > [] ? select_task_rq_fair+0x390/0x700 > [] ? check_preempt_wakeup+0x120/0x1d0 > [] ? wait_for_completion+0xa8/0x120 > [] ? wake_up_state+0x10/0x10 > [] ? call_rcu_bh+0x20/0x20 > [] ? wait_rcu_gp+0x4b/0x60 > [] ? ftrace_raw_output_rcu_utilization+0x40/0x40 > [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci] > [] ? vmci_transport_destruct+0x1d/0xe0 > [vmw_vsock_vmci_transport] > [] ? vsock_sk_destruct+0x13/0x60 [vsock] > [] ? __sk_free+0x1a/0x130 > [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 > [vmw_vsock_vmci_transport] > [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 [vmw_vmci] > [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci] > [] ? tasklet_action+0xf4/0x100 > [] ? __do_softirq+0xf1/0x290 > [] ? irq_exit+0x95/0xa0 > [] ? do_IRQ+0x52/0xe0 > [] ? common_interrupt+0x6d/0x6d > > AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe > to call in interrupt context") but this patch hasn't been backported to > stable trees. It applies cleanly on top of 3.16 stable tree but I am not > familiar with the code to send the backport to the stable maintainer > directly. > > Could you double check that the patch below (just a blind cherry-pick) > is correct and it doesn't need additional patches on top? Hi, The patch below has been used to fix the above issue by other distros - among them Redhat for the 3.10 kernel, so it should work for 3.16 as well. In addition to the patch above, there are two other patches that need to be applied on top for the fix to be correct: 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue." and 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should filter out non matching QPs." Thanks, Jorgen > > Ben could you take this to your stable 3.16 branch if the patch is correct? > > I am CCing Sasha for 4.1 stable tree as well. I haven't checked whether > pre 3.16 kernels are affected as well. > --- > commit fac774c40b5c512113b6373cad498f35bee7a409 > Author: Jorgen Hansen > Date: Wed Oct 21 04:53:56 2015 -0700 > >VSOCK: sock_put wasn't safe to call in interrupt context > >commit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream. > >In the vsock vmci_transport driver, sock_put wasn't safe to call >in interrupt context, since that may call the vsock destructor >which in turn calls several functions that should only be called >from process context. This change defers the callling of these >functions to a worker thread. All these functions were >deallocation of resources related to the transport itself. > >Furthermore, an unused callback was removed to simplify the >cleanup. > >Multiple customers have been hitting this issue when using >VMware tools on vSphere 2015. > >Also added a version to the vmci transport module (starting from >1.0.2.0-k since up until now it appears that this module was >sharing version with vsock that is currently at 1.0.1.0-k). > >Reviewed-by: Aditya Asarwade >Reviewed-by: Thomas Hellstrom >Signed-off-by: Jorgen Hansen >Signed-off-by: David S. Miller > > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c > index 9bb63ffec4f2..aed136d27b01 100644 > --- a/net/vmw_vsock/vmci_transport.c > +++ b/net/vmw_vsock/vmci_transport.c > @@ -40,13 +40,11 @@ > > static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg); > static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram > *dg); > -static void vmci_transport_peer_attach_cb(u32 sub_id, > - const struct vmci_event_data *ed, > - void *client_data); > static void vmci_transport_peer_detach_cb(u32 sub_id, > const struct vmci_event_data *ed, > void *client_data); > static void vmci_transport_recv_pkt_work(struct work_struct *work); > +static
Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?
On Wed, Sep 13, 2017 at 10:22 AM, Jan Engelhardtwrote: > > On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote: >> >>One way to fix is to have iptables open the object (using the stored >>xt_bpf_info_v1->path), gaining a new process local fd for the object, >>just after getting the rules from IPT_SO_GET_ENTRIES. >>However we didn't see any other extensions doing something like that in >>iptables. The binary should call bpf_obj_get on the filepath each time. These are not regular files, but references to a pinned object in the bpf filesystem. Blindly passing back the fd received from the kernel is clearly wrong. I'm really surprised that I did not run into this problem when I wrote the feature. >> >>Another way to solve is to fix the ABI (or have a v2 one), that does NOT >>pass the fd from userspace, only the path of the pinned object. >>Then, 'bpf_mt_check_v1' will open the file from the given path in order >>to get the bpf_prog. > > But a path has a similar problem like a file descriptor - it is local to a > certain mount namespace. Because these are pinned objects in the bpf filesystem, and there is only one of those, it may be possible to lookup the object in the kernel without relying on a process-local view of mount points. > > To load "large" blobs into the kernel, a pointer to user memory is a possible > option. The downside is that such extra data is not retrievable from the > kernel > via the iptables setsockopts anymore - one could work around it with procfs, > or > just let it be. > > https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c > line 64+.
Re: [PATCH 01/10] arch:powerpc: return -ENOMEM on failed allocation
On Wed, 2017-09-13 at 13:02 +0530, Allen Pais wrote: > Signed-off-by: Allen PaisI think the changelog for this series of conversions should show that you've validated the change by inspecting the return call chain at each modified line. Also, it seems you've cc'd the same mailing lists for all of the patches modified by this series. It would be better to individually address each patch in the series only cc'ing the appropriate maintainers and mailing lists. A cover letter would be good too. > --- > arch/powerpc/platforms/cell/spider-pci.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/platforms/cell/spider-pci.c > b/arch/powerpc/platforms/cell/spider-pci.c > index d1e61e2..82aa3f7 100644 > --- a/arch/powerpc/platforms/cell/spider-pci.c > +++ b/arch/powerpc/platforms/cell/spider-pci.c > @@ -106,7 +106,7 @@ static int __init spiderpci_pci_setup_chip(struct > pci_controller *phb, > dummy_page_va = kmalloc(PAGE_SIZE, GFP_KERNEL); > if (!dummy_page_va) { > pr_err("SPIDERPCI-IOWA:Alloc dummy_page_va failed.\n"); > - return -1; > + return -ENOMEM; > } > > dummy_page_da = dma_map_single(phb->parent, dummy_page_va, > @@ -137,7 +137,7 @@ int __init spiderpci_iowa_init(struct iowa_bus *bus, void > *data) > if (!priv) { > pr_err("SPIDERPCI-IOWA:" > "Can't allocate struct spiderpci_iowa_private"); > - return -1; > + return -ENOMEM; > } > > if (of_address_to_resource(np, 0, )) {
Re: [PATCH] dt-bindings: net: renesas-ravb: Add support for R8A77995 RAVB
On Wed, Sep 13, 2017 at 2:17 PM, Yoshihiro Shimodawrote: > Add a new compatible string for the R8A77995 (R-Car D3) RAVB. > > Signed-off-by: Yoshihiro Shimoda Acked-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH] tcp: TCP_USER_TIMEOUT can not work in tcp_probe_timer()
On Wed, 2017-09-13 at 15:15 +0800, liujian wrote: > > 在 2017/9/13 14:56, liujian 写道: > > > > > > 在 2017/9/12 23:38, Eric Dumazet 写道: > >> On Tue, 2017-09-12 at 08:05 -0700, Eric Dumazet wrote: > >>> On Tue, 2017-09-12 at 14:08 +0800, liujian wrote: > Hi, > > In the scenario, tcp server side IP changed, and at that memont, > userspace application still send data continuously; > tcp_send_head(sk)'s timestamp always be refreshed. > > Here is the packetdrill script: > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > +0 bind(3, ..., ...) = 0 > +0 listen(3, 1) = 0 > > +0 < S 0:0(0) win 0 > +0 > S. 0:0(0) ack 1 > > +.1 < . 1:1(0) ack 1 win 65530 > +0 accept(3, ..., ...) = 4 > > +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0 > +0 write(4, ..., 24) = 24 > +0 > P. 1:25(24) ack 1 win 229 > +.1 < . 1:1(0) ack 25 win 65530 > > //change the ipaddress > +1 `ifconfig tun0 192.168.0.10/16` > > +1 write(4, ..., 24) = 24 > +1 write(4, ..., 24) = 24 > +1 write(4, ..., 24) = 24 > +1 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > +3 write(4, ..., 24) = 24 > > +0 `ifconfig tun0 192.168.0.1/16` > +0 < . 1:1(0) ack 1 win 1000 > +0 write(4, ..., 24) = -1 > > > >>> > >>> This has nothing to do with the code patch you have changed. > >>> > >>> How have you tested your patch exactly ? > >>> > > I tested the patch, it can work. > > > > [root@localhost ~]# time ./gtests/net/packetdrill/packetdrill test.pkt > > test.pkt:24: runtime error in write call: Expected result 24 but got -1 > > with errno 110 (Connection timed out) > > > > real0m5.356s > > user0m0.026s > > sys 0m0.104s > > > > [root@localhost ~]# ss -toenmi src :8080 > > State Recv-Q Send-Q Local Address:Port > > Peer Address:Port > > ESTAB 0 48 192.168.0.1:8080 > >192.0.2.1:39559 timer:(persist,186ms,2) > > ino:37178 sk:6 <-> > > skmem:(r0,rb369280,t0,tb87040,f1792,w2304,o0,bl0) sack cubic > > wscale:7,7 rto:301 backoff:2 rtt:100.253/37.643 mss:1460 cwnd:10 > > bytes_acked:24 segs_out:5 segs_in:3 send 1.2Mbps lastsnd:976 lastrcv:4082 > > lastack:3982 pacing_rate 2.3Mbps rcv_space:29200 > > > > > > if change the TCP_USER_TIMEOUT to 30s, test result as below: > > > > [root@localhost ~]# time ./gtests/net/packetdrill/packetdrill test.pkt > > test.pkt:37: runtime error in write call: Expected result 24 but got -1 > > with errno 110 (Connection timed out) > > > > real0m44.362s > > user0m0.018s > > sys 0m0.110s > > > > ESTAB 0 360 192.168.0.1:8080 > > 192.0.2.1:47577 timer:(persist,516ms,6) > > ino:18806 sk:5 <-> > > skmem:(r0,rb369280,t0,tb87040,f1792,w2304,o0,bl0) sack cubic > > wscale:7,7 rto:301 backoff:6 rtt:100.228/37.623 mss:10 bytes_acked:24 > > segs_out:22 segs_in:3 send 1.2Mbps lastsnd:2343 lastrcv:40450 lastack:40350 > > pacing_rate 2.3Mbps rcv_sp > > > >> > >> lpaa23:~# ss -toenmi src :8080 > >> State Recv-Q Send-Q Local Address:Port Peer > >> Address:Port > >> ESTAB 0 144192.168.134.161:8080 > >> 192.0.2.1:51165 timer:(persist,8.262ms,5) ino:1 > >> 82083 sk:3 <-> > >> skmem:(r0,rb359040,t0,tb46080,f1792,w2304,o0,bl0,d0) sack cubic > >> wscale:7,8 rto:301 backoff:5 rtt:100.127/37.576 > >> mss:1460 rcvmss:536 advmss:1460 cwnd:10 bytes_acked:24 segs_out:12 > >> segs_in:3 data_segs_out:12 send 1.2Mbps lastsnd:1370 l > >> astrcv:13348 lastack:13248 pacing_rate 2.3Mbps delivery_rate 116.7Kbps > >> app_limited busy:11346ms rcv_space:29200 notsent:1 > >> 44 minrtt:100.043userspace application still send data continuously > >> > >> This is the typical RTO timer, not zero window probe. > >> > > with the script, it is not zero window; but the code enter tcp_probe_timer. > > ->tcp_sendmsg > > -->tcp_push > >
Re: netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?
On Wednesday 2017-09-13 15:24, Shmulik Ladkani wrote: > >One way to fix is to have iptables open the object (using the stored >xt_bpf_info_v1->path), gaining a new process local fd for the object, >just after getting the rules from IPT_SO_GET_ENTRIES. >However we didn't see any other extensions doing something like that in >iptables. > >Another way to solve is to fix the ABI (or have a v2 one), that does NOT >pass the fd from userspace, only the path of the pinned object. >Then, 'bpf_mt_check_v1' will open the file from the given path in order >to get the bpf_prog. But a path has a similar problem like a file descriptor - it is local to a certain mount namespace. To load "large" blobs into the kernel, a pointer to user memory is a possible option. The downside is that such extra data is not retrievable from the kernel via the iptables setsockopts anymore - one could work around it with procfs, or just let it be. https://sourceforge.net/p/xtables-addons/xtables-addons/ci/master/tree/extensions/xt_geoip.c line 64+.
Re: [PATCH net] sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
On Wed, Sep 13, 2017 at 12:20:28PM +0300, Dan Carpenter wrote: > This code causes a static checker warning because Smatch doesn't trust > anything that comes from skb->data. I've reviewed this code and I do > think skb->data can be controlled by the user here. > > The sctp_event_subscribe struct has 13 __u8 fields and we want to see > if ours is non-zero. sn_type can be any value in the 0-USHRT_MAX range. > We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read > either before the start of the struct or after the end. > > This is a very old bug and it's surprising that it would go undetected > for so long but my theory is that it just doesn't have a big impact so > it would be hard to notice. > > Signed-off-by: Dan Carpenter> --- > I'm not terribly familiar with sctp and this is a static checker fix. > Please review it carefully. > > diff --git a/include/net/sctp/ulpevent.h b/include/net/sctp/ulpevent.h > index 1060494ac230..e6873176bea7 100644 > --- a/include/net/sctp/ulpevent.h > +++ b/include/net/sctp/ulpevent.h > @@ -154,7 +154,11 @@ static inline int sctp_ulpevent_type_enabled(__u16 > sn_type, >struct sctp_event_subscribe *mask) > { > char *amask = (char *) mask; > - return amask[sn_type - SCTP_SN_TYPE_BASE]; > + int offset = sn_type - SCTP_SN_TYPE_BASE; > + > + if (offset >= sizeof(struct sctp_event_subscribe)) > + return 0; > + return amask[offset]; > } Acked-by: Neil Horman > > /* Given an event subscription, is this event enabled? */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [oss-drivers] Re: [PATCH/RFC net-next 2/2] net/sched: allow flower to match tunnel options
On Wed, Sep 13, 2017 at 03:38:01PM +0300, Or Gerlitz wrote: > On Wed, Sep 13, 2017 at 2:59 PM, Simon Horman >wrote: > > On Wed, Sep 13, 2017 at 01:03:43PM +0300, Or Gerlitz wrote: > >> On Wed, Sep 13, 2017 at 12:25 PM, Simon Horman > >> wrote: > >> > On Tue, Sep 12, 2017 at 11:23:55PM +0300, Or Gerlitz wrote: > >> >> On Tue, Sep 12, 2017 at 5:20 PM, Simon Horman > >> >> wrote: > >> >> > Allow matching on options in tunnel headers. > >> >> > This makes use of existing tunnel metadata support. > >> >> > >> >> Simon, > >> >> > >> >> This patch is about matching on tunnel options, right? but > >> >> > >> >> > Options are a bytestring of up to 256 bytes. > >> >> > Tunnel implementations may support less or more options, > >> >> > or no options at all. > >> >> > > >> >> > # ip link add name geneve0 type geneve dstport 0 external > >> >> > # tc qdisc add dev eth0 ingress > >> >> > # tc qdisc del dev eth0 ingress; tc qdisc add dev eth0 ingress > >> >> > # tc filter add dev eth0 protocol ip parent : \ > >> >> > flower indev eth0 \ > >> >> > ip_proto udp \ > >> >> > action tunnel_key \ > >> >> > set src_ip 10.0.99.192 \ > >> >> > dst_ip 10.0.99.193 \ > >> >> > dst_port 4789 \ > >> >> > id 11 \ > >> >> > opts 0102800100800022 \ > >> >> > action mirred egress redirect dev geneve0 > >> >> > >> >> the example here is on how to use tunnel options in the tunnel set key > >> >> actions.. > >> >> > >> >> And the other way around in the other patch... the patch is about the > >> >> tunnel key set action and the example shows how to match that in > >> >> flower... I guess you want to swap the relevant of the change log. > >> > > >> > Yes, it seems so. Sorry about that. > >> > >> no worries, you can do the swap, but before that > >> > >> >> Anyway, is there any human readable/understandable representation of > >> >> these options? e.g what does 0102800100800022 means for geneve? > >> > >> > Thanks, I had not thought of that. My feeling is that could > >> > be added to the tc tool as follow-up work. > >> > >> could you spend few words on the nature of these options now when we review > >> the kernel patches? I guess this is somehow related to protocols such > >> as geneve and vxlan-gpe -- it would be good if you elaborate on that > >> a bit, does > >> the kernel does any usage with these options beyond matching on them or > >> stiching > >> them to packet headers? > > > > This feature is to be used in conjunction with tunnels in collect metadata > > (external) mode. As I understand it there are three tunnel netdevs that use > > options metadata in the kernel at this time. > > > > * Geneve > > > > In the case of Geneve options are TLVs[1]. My reading is that in collect > > metadata mode the kernel does not appear to do anything other than pass > > them around as a bytestring. > > > > [1] https://tools.ietf.org/html/draft-ietf-nvo3-geneve-05#section-3.5 > > > > * VXLAN-GBP > > > > In the case of VXLAN-GBP on RX in collect metadata mode options are used > > to carry information parsed in vxlan_parse_gbp_hdr() from the VXLAN Group > > Based Policy Extension[2]. On RX the options data is used to create an > > extension (header) by vxlan_build_gbp_hdr(). > > > > [2] > > https://tools.ietf.org/html/draft-smith-vxlan-group-policy-03#section-2.1 > > > > * ERSPAN (GRE) > > > > In the case of ERSPAN, which is a variant of GRE, on RX in collect > > metadata mode options are used to carry the index parsed from the ERSPAN > > Type II feature header[3] in erspan_rcv(). The converse is true on TX > > and is handled by erspan_fb_xmit(). > > > > [3] https://tools.ietf.org/html/draft-foschiano-erspan-03#section-4.2 > > > > Users of options: > > > > * There are eBPF hooks to allow getting on and setting tunnel metadata: > > bpf_skb_set_tunnel_opt, bpf_skb_get_tunnel_opt. > > > > * Open vSwitch is able to match and set Geneve and VXLAN-GBP options. > > > > Neither of the above appear to assume any structure for the data. > > > > I hope the above is the kind of information you are after. > > yeah, this helps, do we expect HW offloads to be able to work with > these options? e.g place/match on packets > or even do something beyond that? Perhaps unsurprisingly given my posting I would like to offload "place/match" of options. I do not currently have any plans beyond that and my crystal ball broke some time ago.
Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower
Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote: >This patch enables tc-flower based hardware offloads. tc flower >filter provided by the kernel is configured as driver specific >cloud filter. The patch implements functions and admin queue >commands needed to support cloud filters in the driver and >adds cloud filters to configure these tc-flower filters. > >The only action supported is to redirect packets to a traffic class >on the same device. So basically you are not doing redirect, you are just setting tclass for matched packets, right? Why you use mirred for this? I think that you might consider extending g_act for that: # tc filter add dev eth0 protocol ip ingress \ prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \ action tclass 0 > ># tc qdisc add dev eth0 ingress ># ethtool -K eth0 hw-tc-offload on > ># tc filter add dev eth0 protocol ip parent :\ > prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\ > action mirred ingress redirect dev eth0 tclass 0 > ># tc filter add dev eth0 protocol ip parent :\ > prio 2 flower dst_ip 192.168.3.5/32\ > ip_proto udp dst_port 25 skip_sw\ > action mirred ingress redirect dev eth0 tclass 1 > ># tc filter add dev eth0 protocol ipv6 parent :\ > prio 3 flower dst_ip fe8::200:1\ > ip_proto udp dst_port 66 skip_sw\ > action mirred ingress redirect dev eth0 tclass 1 > >Delete tc flower filter: >Example: > ># tc filter del dev eth0 parent : prio 3 handle 0x1 flower ># tc filter del dev eth0 parent : > >Flow Director Sideband is disabled while configuring cloud filters >via tc-flower and until any cloud filter exists. > >Unsupported matches when cloud filters are added using enhanced >big buffer cloud filter mode of underlying switch include: >1. source port and source IP >2. Combined MAC address and IP fields. >3. Not specifying L4 port > >These filter matches can however be used to redirect traffic to >the main VSI (tc 0) which does not require the enhanced big buffer >cloud filter support. > >v3: Cleaned up some lengthy function names. Changed ipv6 address to >__be32 array instead of u8 array. Used macro for IP version. Minor >formatting changes. >v2: >1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h >2. Moved dev_info for add/deleting cloud filters in else condition >3. Fixed some format specifier in dev_err logs >4. Refactored i40e_get_capabilities to take an additional > list_type parameter and use it to query device and function > level capabilities. >5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc() > to verify if redirect to a traffic class is supported. >6. Added comments for Geneve fix in cloud filter big buffer AQ > function definitions. >7. Cleaned up setup_tc interface to rebase and work with Jiri's > updates, separate function to process tc cls flower offloads. >8. Changes to make Flow Director Sideband and Cloud filters mutually > exclusive. > >Signed-off-by: Amritha Nambiar>Signed-off-by: Kiran Patil >Signed-off-by: Anjali Singhai Jain >Signed-off-by: Jingjing Wu >--- > drivers/net/ethernet/intel/i40e/i40e.h | 49 + > drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |3 > drivers/net/ethernet/intel/i40e/i40e_common.c | 189 > drivers/net/ethernet/intel/i40e/i40e_main.c| 971 +++- > drivers/net/ethernet/intel/i40e/i40e_prototype.h | 16 > drivers/net/ethernet/intel/i40e/i40e_type.h|1 > .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h|3 > 7 files changed, 1202 insertions(+), 30 deletions(-) > >diff --git a/drivers/net/ethernet/intel/i40e/i40e.h >b/drivers/net/ethernet/intel/i40e/i40e.h >index 6018fb6..b110519 100644 >--- a/drivers/net/ethernet/intel/i40e/i40e.h >+++ b/drivers/net/ethernet/intel/i40e/i40e.h >@@ -55,6 +55,8 @@ > #include > #include > #include >+#include >+#include > #include "i40e_type.h" > #include "i40e_prototype.h" > #include "i40e_client.h" >@@ -252,9 +254,52 @@ struct i40e_fdir_filter { > u32 fd_id; > }; > >+#define IPV4_VERSION 4 >+#define IPV6_VERSION 6 >+ >+#define I40E_CLOUD_FIELD_OMAC 0x01 >+#define I40E_CLOUD_FIELD_IMAC 0x02 >+#define I40E_CLOUD_FIELD_IVLAN0x04 >+#define I40E_CLOUD_FIELD_TEN_ID 0x08 >+#define I40E_CLOUD_FIELD_IIP 0x10 >+ >+#define I40E_CLOUD_FILTER_FLAGS_OMAC I40E_CLOUD_FIELD_OMAC >+#define I40E_CLOUD_FILTER_FLAGS_IMAC I40E_CLOUD_FIELD_IMAC >+#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN(I40E_CLOUD_FIELD_IMAC | \ >+ I40E_CLOUD_FIELD_IVLAN) >+#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID (I40E_CLOUD_FIELD_IMAC | \ >+ I40E_CLOUD_FIELD_TEN_ID) >+#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \ >+I40E_CLOUD_FIELD_IMAC | \ >+
netfilter: xt_bpf: ABI issue in xt_bpf_info_v1?
Hi, Commit 2c16d60 'netfilter: xt_bpf: support ebpf' introduced 'xt_bpf_info_v1', to support attaching an eBPF object by fd. Alas, seems this ABI is problematic, as the 'fd', which is local to the process attaching the ebpf object (namely iptables) is stored in the matchinfo structure. This leads to subsequent invocation of iptables to fail: # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT # iptables -A INPUT -s 5.6.7.8 -j ACCEPT iptables: Invalid argument. Run `dmesg' for more information. Tracing 2nd invocation shows: setsockopt(4, SOL_IP, IPT_SO_SET_REPLACE, "filter\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 5400) = -1 EINVAL In the 2nd invocation, iptables loads existing rules from the kernel using IPT_SO_GET_ENTRIES, adds the new innocent one, and then issues IPT_SO_SET_REPLACE. However the existing rules contain the former ebpf rule which contains the fd number from the initial 'iptables -m bpf' invocation. Thus, when IPT_SO_SET_REPLACE eventually hits 'bpf_mt_check_v1', and 'bpf_prog_get_type' is called, the arbitrary fd has either no associated file, or has no associated bpf_prog - leading 'bpf_prog_get_type' to fail. One way to fix is to have iptables open the object (using the stored xt_bpf_info_v1->path), gaining a new process local fd for the object, just after getting the rules from IPT_SO_GET_ENTRIES. However we didn't see any other extensions doing something like that in iptables. Another way to solve is to fix the ABI (or have a v2 one), that does NOT pass the fd from userspace, only the path of the pinned object. Then, 'bpf_mt_check_v1' will open the file from the given path in order to get the bpf_prog. Please advise. Best, Shmulik
Re: [RFC PATCH v3 2/7] sched: act_mirred: Traffic class option for mirror/redirect action
Wed, Sep 13, 2017 at 11:59:24AM CEST, amritha.namb...@intel.com wrote: >Adds optional traffic class parameter to the mirror/redirect action. >The mirror/redirect action is extended to forward to a traffic >class on the device if the traffic class index is provided in >addition to the device's ifindex. Do I understand it correctly that you just abuse mirred to pas tcclass index down to the driver, without actually doing anything with the value inside mirred-code ? That is a bit confusing for me.