date:20170506

Information.

2017-05-06 Thread Info

I want to inquire if you can be our regional representative in your region. It 
would by no Means interfere with your current job. If you are interested please 
contact Mr. Yu Yong
Email: yong...@hblsco.com for more information on the job. I look forward to 
hearing from you.


Sincerely,
Yu Yong.

Re: net/smc and the RDMA core

2017-05-06 Thread h...@lst.de

On Fri, May 05, 2017 at 11:10:17AM -0600, Jason Gunthorpe wrote:
> I recommend immediately sending a kconfig patch cc'd to stable making
> SMC require CONFIG_BROKEN so that nobody inadvertantly turns it on.

Yes, I'll send the patch.

[PATCH net] cxgb4: avoid disabling FEC by default

2017-05-06 Thread Ganesh Goudar

Recent Chelsio firmware started using few port capablity bits to
manage FEC and as driver was not aware of FEC changes those bits
were zeroed, consequently disabling FEC.

Avoid zeroing those bits and default to whatever the firmware
tells us the Link is currently advertising.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h|  9 +++
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c| 38 ++-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h |  6 ++---
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 163543b..862e008 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -108,6 +108,12 @@ enum {
PAUSE_AUTONEG = 1 << 2
 };
 
+enum {
+   FEC_AUTO  = 1 << 0,  /* IEEE 802.3 "automatic" */
+   FEC_RS= 1 << 1,  /* Reed-Solomon */
+   FEC_BASER_RS  = 1 << 2   /* BaseR/Reed-Solomon */
+};
+
 struct port_stats {
u64 tx_octets;/* total # of octets in good frames */
u64 tx_frames;/* all good frames */
@@ -432,6 +438,9 @@ struct link_config {
unsigned int   speed;/* actual link speed */
unsigned char  requested_fc; /* flow control user has requested */
unsigned char  fc;   /* actual link flow control */
+   unsigned char  auto_fec; /* Forward Error Correction: */
+   unsigned char  requested_fec;/* "automatic" (IEEE 802.3), */
+   unsigned char  fec;  /* requested, and actual in use */
unsigned char  autoneg;  /* autonegotiating? */
unsigned char  link_ok;  /* link up? */
unsigned char  link_down_rc; /* link down reason */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 0de8eb7..aded42b96 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -3707,7 +3707,8 @@ int t4_link_l1cfg(struct adapter *adap, unsigned int 
mbox, unsigned int port,
  struct link_config *lc)
 {
struct fw_port_cmd c;
-   unsigned int fc = 0, mdi = FW_PORT_CAP_MDI_V(FW_PORT_CAP_MDI_AUTO);
+   unsigned int mdi = FW_PORT_CAP_MDI_V(FW_PORT_CAP_MDI_AUTO);
+   unsigned int fc = 0, fec = 0, fw_fec = 0;
 
lc->link_ok = 0;
if (lc->requested_fc & PAUSE_RX)
@@ -3715,6 +3716,13 @@ int t4_link_l1cfg(struct adapter *adap, unsigned int 
mbox, unsigned int port,
if (lc->requested_fc & PAUSE_TX)
fc |= FW_PORT_CAP_FC_TX;
 
+   fec = lc->requested_fec & FEC_AUTO ? lc->auto_fec : lc->requested_fec;
+
+   if (fec & FEC_RS)
+   fw_fec |= FW_PORT_CAP_FEC_RS;
+   if (fec & FEC_BASER_RS)
+   fw_fec |= FW_PORT_CAP_FEC_BASER_RS;
+
memset(&c, 0, sizeof(c));
c.op_to_portid = cpu_to_be32(FW_CMD_OP_V(FW_PORT_CMD) |
 FW_CMD_REQUEST_F | FW_CMD_EXEC_F |
@@ -3725,13 +3733,15 @@ int t4_link_l1cfg(struct adapter *adap, unsigned int 
mbox, unsigned int port,
 
if (!(lc->supported & FW_PORT_CAP_ANEG)) {
c.u.l1cfg.rcap = cpu_to_be32((lc->supported & ADVERT_MASK) |
-fc);
+fc | fw_fec);
lc->fc = lc->requested_fc & (PAUSE_RX | PAUSE_TX);
} else if (lc->autoneg == AUTONEG_DISABLE) {
-   c.u.l1cfg.rcap = cpu_to_be32(lc->requested_speed | fc | mdi);
+   c.u.l1cfg.rcap = cpu_to_be32(lc->requested_speed | fc |
+fw_fec | mdi);
lc->fc = lc->requested_fc & (PAUSE_RX | PAUSE_TX);
} else
-   c.u.l1cfg.rcap = cpu_to_be32(lc->advertising | fc | mdi);
+   c.u.l1cfg.rcap = cpu_to_be32(lc->advertising | fc |
+fw_fec | mdi);
 
return t4_wr_mbox(adap, mbox, &c, sizeof(c), NULL);
 }
@@ -7407,13 +7417,26 @@ static void get_pci_mode(struct adapter *adapter, 
struct pci_params *p)
  * Initializes the SW state maintained for each link, including the link's
  * capabilities and default speed/flow-control/autonegotiation settings.
  */
-static void init_link_config(struct link_config *lc, unsigned int caps)
+static void init_link_config(struct link_config *lc, unsigned int pcaps,
+unsigned int acaps)
 {
-   lc->supported = caps;
+   lc->supported = pcaps;
lc->lp_advertising = 0;
lc->requested_speed = 0;
lc->speed = 0;
lc->requested_fc = lc->fc = PAUSE_RX | PAUSE_TX;
+   lc->auto_fec = 0;
+
+   /* For Forward Error Control, we default to whatever the Firmware
+* tells us the Link is currently advertising.
+*/
+   if (acaps & F

Re: [PATCH net-next 4/5] dsa: Microchip KSZ switches SPI devicetree configuration

2017-05-06 Thread Sergei Shtylyov


Hello!

On 5/6/2017 2:18 AM, woojung@microchip.com wrote:


From: Woojung Huh 

A sample SPI configuration for Microchip KSZ switches.

Signed-off-by: Woojung Huh 
---
 Documentation/devicetree/bindings/net/dsa/ksz.txt | 73 +++
 1 file changed, 73 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/dsa/ksz.txt

diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt 
b/Documentation/devicetree/bindings/net/dsa/ksz.txt
new file mode 100644
index 000..9cca7d4
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -0,0 +1,73 @@
+Microchip KSZ Series Ethernet switches
+==
+
+Required properties:
+
+- compatible: For external switch chips, compatible string must be exactly one
+  of: "microchip,ksz9477"
+
+See Documentation/devicetree/bindings/dsa/dsa.txt for a list of additional
+required and optional properties.
+
+Examples:
+
+Ethernet switch connected via SPI to the host, CPU port wired to eth0:
+
+   eth0: ethernet@10001000 {
+   fixed-link {
+   reg = <7>
+   speed = <1000>;
+   duplex-full;
+   };
+   };
+
+   spi1: spi@f8008000 {
+   pinctrl-0 = <&pinctrl_spi_ksz>;
+   cs-gpios = <&pioC 25 0>;
+   id = <1>;
+   status = "okay";


   4 lines above should be indented more to the right.


+
+   ksz9477: ksz9477@0 {
+   compatible = "microchip,ksz9477";
+   reg = <0>;
+
+   spi-max-frequency = <4400>;
+   spi-cpha;
+   spi-cpol;
+
+   status = "okay";
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   port@0 {
+   reg = <0>;
+   label = "lan1";
+   };
+   port@1 {
+   reg = <1>;
+   label = "lan2";
+   };
+   port@2 {
+   reg = <2>;
+   label = "lan3";
+   };
+   port@3 {
+   reg = <3>;
+   label = "lan4";
+   };
+   port@4 {
+   reg = <4>;
+   label = "lan5";
+   };
+   port@5 {
+   reg = <5>;
+   label = "cpu";
+   ethernet = <&macb0>;
+   fixed-link {
+   speed = <1000>;
+   full-duplex;
+   };
+   };
+   };
+   };
+   };


   Unmatched }?

MBR, Sergei

Re: [RFC] iproute: Add support for extended ack to rtnl_talk

2017-05-06 Thread Jiri Pirko

Thu, May 04, 2017 at 07:55:56PM CEST, l...@kernel.org wrote:
>On Thu, May 04, 2017 at 09:45:58AM -0700, Stephen Hemminger wrote:
>> On Thu, 4 May 2017 17:37:38 +0300
>> Leon Romanovsky  wrote:
>>
>> > On Thu, May 04, 2017 at 11:36:36AM +0200, Daniel Borkmann wrote:
>> > > On 05/04/2017 01:56 AM, Stephen Hemminger wrote:
>> > > > Add support for extended ack error reporting via libmnl. This
>> > > > is a better alternative to use existing library and not copy/paste
>> > > > code from the kernel. Also make arguments const where possible.
>> > > >
>> > > > Add a new function rtnl_talk_extack that takes a callback as an input
>> > > > arg. If a netlink response contains extack attributes, the callback is
>> > > > is invoked with the the err string, offset in the message and a pointer
>> > > > to the message returned by the kernel.
>> > > >
>> > > > Adding a new function allows commands to be moved over to the
>> > > > extended error reporting over time.
>> > > >
>> > > > For feedback, compile tested only.
>> > >
>> > > Just out of curiosity, what is the plan regarding converting iproute2
>> > > over to libmnl (ip, tc, ss, ...)? In 2015, tipc tool was the first
>> > > user merged that requires libmnl, the only other user today in the
>> > > tree is devlink, which even seems to define its own libmnl library
>> > > helpers. What is the clear benefit/rationale of outsourcing this to
>> > > libmnl? I always was the impression we should strive for as little
>> > > dependencies as possible?
>> >
>> > And I would like to get direction for the RDMA tool [1] which I'm
>> > working on it now.
>> >
>> > The overall decision was to use netlink and put it under iproute2
>> > umbrella. Currently, I have working RFC which is based on
>> > legacy sysfs interface to ensure that we are converging on
>> > user-experience even before moving to actual netlink defines.
>> >
>> > An I would like to continue to work on netlink interface, but which lib 
>> > interface
>> > should I need to base rdmatool's netlink code?
>> >
>> > [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg148523.html
>> >
>> > >
>> > > I don't really like that we make extended ack reporting now dependent
>> > > on libmnl, which further diverts from iproute's native nl library vs
>> > > requiring to install another nl library, making the current status
>> > > quo even worse ... :/
>> > >
>> > > Thanks,
>> > > Daniel
>>
>> I would prefer new code use libmnl, but using libnetlink would also be ok.
>> Any later conversion to libmnl would be mostly automated anyway.
>
>Thanks, I'm copy/pasting devlink variation of libmnl :)

I needed couple of small helpers for generic netlink support. I believe
they could be pushed to upstream libmnl so we can avoid having them in
iproute2


>
>>
>> The real objection was copy/pasting in the kernel netlink parser.
>> That was unnecessary bloat.
>
>

Re: [Unstrung-hackers] [RFC net-next] ipv6: ext_header: add function to handle RPL extension header option 0x63

2017-05-06 Thread Jiri Pirko

Fri, May 05, 2017 at 09:55:54AM CEST, bardout...@ceid.upatras.gr wrote:
>Yes I think we have faced the same problem,communication with RPL supporting
>devices was failing otherwise.Your patch is also more complete since it also
>implements #ifdef .About the comment,yes I have run checkpatch twice with no
>errors,but ok :)

Top-posting is highly annoying. Please stop with that.

>
>Στις 2017-05-05 08:59, JANARDHANACHARI KELLA έγραψε:
>> I was inserted this patch manually. It was working. on 4.9 kernel.
>> 
>> check this bellow link for your ref.
>> 
>> https://github.com/mwasilak/bluetooth-next/commit/f29c632ef6a6a1777815c97fd2f326faccc704f7
>> [2]
>> 
>> On Thu, May 4, 2017 at 9:30 PM, Jiri Pirko  wrote:
>> 
>> > Thu, May 04, 2017 at 05:17:18PM CEST, bardout...@ceid.upatras.gr
>> > wrote:
>> > > Signed-off-by: Andreas Bardoutsos 
>> > > ---
>> > > Hi all!
>> > > 
>> > > I have added a dump function(always return true) to recognise RPL
>> > extension
>> > > header(RFC6553)
>> > > Otherwise packet was dropped by kernel resulting in failing
>> > communication in
>> > > RPL DAG's between
>> > > linux running border routers and devices in the graph.For example
>> > > communication
>> > > with contiki OS running devices was previously impossible.
>> > > 
>> > > include/uapi/linux/in6.h | 1 +
>> > > net/ipv6/exthdrs.c | 13 +
>> > > 2 files changed, 14 insertions(+)
>> > > 
>> > > diff --git a/include/uapi/linux/in6.h b/include/uapi/linux/in6.h
>> > > index 46444f8fbee4..5cc12d309dfe 100644
>> > > --- a/include/uapi/linux/in6.h
>> > > +++ b/include/uapi/linux/in6.h
>> > > @@ -146,6 +146,7 @@ struct in6_flowlabel_req {
>> > > #define IPV6_TLV_CALIPSO 7 /* RFC 5570 */
>> > > #define IPV6_TLV_JUMBO 194
>> > > #define IPV6_TLV_HAO 201 /* home address option */
>> > > +#define IPV6_TLV_RPL 99 /* RFC 6553 */
>> > > 
>> > > /*
>> > > * IPV6 socket options
>> > > diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
>> > > index b636f1da9aec..82ed60d3180e 100644
>> > > --- a/net/ipv6/exthdrs.c
>> > > +++ b/net/ipv6/exthdrs.c
>> > > @@ -785,6 +785,15 @@ static bool ipv6_hop_calipso(struct sk_buff
>> > *skb, int
>> > > optoff)
>> > > return false;
>> > > }
>> > > 
>> > > +/* RPL RFC 6553 */
>> > > +
>> > > +static bool ipv6_hop_rpl(struct sk_buff *skb, int optoff)
>> > > +{
>> > > + /*Dump function which always return true
>> > > + *when rpl option is detected*/
>> > 
>> > This is definitelly wrong formatting of comment. Did you run
>> > checkpatch?
>> > 
>> > > + return true;
>> > > +}
>> > > +
>> > > static const struct tlvtype_proc tlvprochopopt_lst[] = {
>> > > {
>> > > .type = IPV6_TLV_ROUTERALERT,
>> > > @@ -798,6 +807,10 @@ static const struct tlvtype_proc
>> > tlvprochopopt_lst[] = {
>> > > .type = IPV6_TLV_CALIPSO,
>> > > .func = ipv6_hop_calipso,
>> > > },
>> > > + {
>> > > + .type = IPV6_TLV_RPL,
>> > > + .func = ipv6_hop_rpl,
>> > > + },
>> > > { -1, }
>> > > };
>> > > 
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> > linux-wpan" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> > [1]
>> 
>> --
>> 
>> Sincerely Your's
>> 
>> Janardhanachari Kella
>> 
>> Contact:+91-9908469599
>> E-mail: eni.ch...@gmail.com
>> 
>> 
>> Links:
>> --
>> [1] http://vger.kernel.org/majordomo-info.html
>> [2]
>> https://github.com/mwasilak/bluetooth-next/commit/f29c632ef6a6a1777815c97fd2f326faccc704f7
>> 
>> ___
>> Unstrung-hackers mailing list
>> unstrung-hack...@lists.sandelman.ca
>> https://lists.sandelman.ca/mailman/listinfo/unstrung-hackers
>

Re: [RFC iproute2 0/8] RDMA tool

2017-05-06 Thread Jiri Pirko

Thu, May 04, 2017 at 08:10:54PM CEST, bart.vanass...@sandisk.com wrote:
>On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
>> Following our discussion both in mailing list [1] and at the LPC 2016 [2],
>> we would like to propose this RDMA tool to be part of iproute2 package
>> and finally improve this situation.
>
>Hello Leon,
>
>Although I really appreciate your work: can you clarify why you would like to
>add *RDMA* functionality to an *IP routing* tool? I haven't found any 
>motivation
>for adding RDMA functionality to iproute2 in [1].

Bart, please realize that iproute2 is much more than "*IP routing* tool".
I understand you got confused by the name. Please see sources. Your comment
is totally pointless...

Re: [RFC iproute2 0/8] RDMA tool

2017-05-06 Thread Jiri Pirko

Fri, May 05, 2017 at 03:17:54PM CEST, l...@kernel.org wrote:
>On Fri, May 05, 2017 at 08:54:57AM +0200, Jiri Benc wrote:
>> On Thu,  4 May 2017 21:02:08 +0300, Leon Romanovsky wrote:
>> > In order to close object model, ensure reuse of existing code and make this
>> > tool usable from day one, we decided to implement wrappers over legacy 
>> > sysfs
>> > prior to implementing netlink functionality. As a nice bonus, it will allow
>> > to use this tool with old kernels too.
>>
>> This sounds wrong. We don't support legacy ioctl interface for the 'ip'
>> command, either. I think rdma should be converted to netlink first and
>> the new tool should only use netlink.
>
>RDMA in slightly different situation than "ip" tool was. "ip" was implemented
>when tools like ifconfig existed. It allowed to old and new systems to be
>configured to some degree. In RDMA community, there are no similar tools like
>"ifconfig". Implementation in netlink-only interface will leave old systems 
>without
>common tool at all.
>
>As an upstream-oriented person, I personally fine with that, but anyway would
>like to get wider agreement/disagreement on that, before removing sysfs
>parsing logic from the rdmatool.

I tend to agree with Jiri Benc. I fear that supporting sysfs + netlink
api later on for the same things will make the code unnecessary complex.
Also, the legacy sysfs will most likely stay there forever so there will
be no actual motivation to port the existing things to the new netlink
api.

For the prototyping purposes, I belive that what you did makes perfect
sense. But for the actual mergable version, my feeling is that we need
to strictly stick with new netlink rdma interface and just forget about
the old sysfs one. Distros would have to backport the new kernel
rdma netlink api.

Yes, this will be little bit more painful at the beginning, but in the
long run, I believe it will save some severe headaches.

Re: [PATCH v4 net-next 0/2] rtnetlink: Updates to rtnetlink_event()

2017-05-06 Thread Jiri Pirko

Fri, May 05, 2017 at 10:52:47PM CEST, vyasev...@gmail.com wrote:
>This is a version 4 series came out of the conversation that started
>as a result my first attempt to add netdevice event info to netlink messages.
>
>First is the patch to add IFLA_EVENT attribute to the netlink message.  It
>supports only currently white-listed events.
>Like before, this is just an attribute that gets added to the rtnetlink
>message only when the messaged was generated as a result of a netdev event.
>In my case, this is necessary since I want to trap NETDEV_NOTIFY_PEERS
>event (also possibly NETDEV_RESEND_IGMP event) and perform certain actions
>in user space.  This is not possible since the messages generated as

What are you trying to do in userspace if I may ask.

[PATCH 0/2] KCM: Fine-tuning for three function implementations

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 14:11:22 +0200

Two update suggestions were taken into account
from static source code analysis.

Markus Elfring (2):
  Replace three seq_puts() calls by seq_putc()
  Use seq_puts() in kcm_format_psock()

 net/kcm/kcmproc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.12.2

[PATCH 1/2] kcm: Replace three seq_puts() calls by seq_putc()

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 13:53:41 +0200

Three single characters (line breaks) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/kcm/kcmproc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
index bf75c9231cca..46b8b5f6c57f 100644
--- a/net/kcm/kcmproc.c
+++ b/net/kcm/kcmproc.c
@@ -116,7 +116,7 @@ static void kcm_format_mux_header(struct seq_file *seq)
   "Status");
 
/* XXX: pdsts header stuff here */
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
 }
 
 static void kcm_format_sock(struct kcm_sock *kcm, struct seq_file *seq,
@@ -146,7 +146,7 @@ static void kcm_format_sock(struct kcm_sock *kcm, struct 
seq_file *seq,
if (kcm->rx_wait)
seq_puts(seq, "RxWait ");
 
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
 }
 
 static void kcm_format_psock(struct kcm_psock *psock, struct seq_file *seq,
@@ -192,7 +192,7 @@ static void kcm_format_psock(struct kcm_psock *psock, 
struct seq_file *seq,
seq_puts(seq, "RdyRx ");
}
 
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
 }
 
 static void
-- 
2.12.2

[PATCH 2/2] kcm: Use seq_puts() in kcm_format_psock()

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 14:04:02 +0200

A string which did not contain a data format specification should be put
into a sequence. Thus use the corresponding function "seq_puts".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/kcm/kcmproc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
index 46b8b5f6c57f..b59b46822d9e 100644
--- a/net/kcm/kcmproc.c
+++ b/net/kcm/kcmproc.c
@@ -182,7 +182,7 @@ static void kcm_format_psock(struct kcm_psock *psock, 
struct seq_file *seq,
seq_printf(seq, "RxWait=%u ",
   psock->strp.rx_need_bytes);
else
-   seq_printf(seq, "RxWait ");
+   seq_puts(seq, "RxWait ");
}
} else  {
if (psock->strp.rx_paused)
-- 
2.12.2

Re: [RFC iproute2 0/8] RDMA tool

2017-05-06 Thread Bart Van Assche

On Sat, 2017-05-06 at 12:40 +0200, Jiri Pirko wrote:
> Thu, May 04, 2017 at 08:10:54PM CEST, bart.vanass...@sandisk.com wrote:
> > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > Following our discussion both in mailing list [1] and at the LPC 2016 [2],
> > > we would like to propose this RDMA tool to be part of iproute2 package
> > > and finally improve this situation.
> > 
> > Although I really appreciate your work: can you clarify why you would like 
> > to
> > add *RDMA* functionality to an *IP routing* tool? I haven't found any 
> > motivation
> > for adding RDMA functionality to iproute2 in [1].
> 
> Bart, please realize that iproute2 is much more than "*IP routing* tool".
> I understand you got confused by the name. Please see sources. Your comment
> is totally pointless...

I asked for a clarification that should have been in the cover letter but that
was missing from that cover letter. So I think that was the right thing to do
instead of pointless. BTW, can you explain why you are using an e-mail address
that is hiding that you are a Mellanox employee?

Bart.

Re: [PATCH] net: dsa: loop: Check for memory allocation failure

2017-05-06 Thread Andrew Lunn

On Sat, May 06, 2017 at 07:29:45AM +0200, Christophe JAILLET wrote:
> If 'devm_kzalloc' fails, a NULL pointer will be dereferenced.
> Return -ENOMEM instead, as done for some other memory allocation just a
> few lines above.
> 
> Fixes: 98cd1552ea27 ("net: dsa: Mock-up driver")
> 
> Signed-off-by: Christophe JAILLET 

Reviewed-by: Andrew Lunn 

Andrew

[PATCH] wlcore: use memdup_user

2017-05-06 Thread Geliang Tang

Use memdup_user() helper instead of open-coding to simplify the code.

Signed-off-by: Geliang Tang 
---
 drivers/net/wireless/ti/wlcore/debugfs.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/net/wireless/ti/wlcore/debugfs.c 
b/drivers/net/wireless/ti/wlcore/debugfs.c
index de7e2a5..a2cb408 100644
--- a/drivers/net/wireless/ti/wlcore/debugfs.c
+++ b/drivers/net/wireless/ti/wlcore/debugfs.c
@@ -1149,15 +1149,9 @@ static ssize_t dev_mem_write(struct file *file, const 
char __user *user_buf,
part.mem.start = *ppos;
part.mem.size = bytes;
 
-   buf = kmalloc(bytes, GFP_KERNEL);
-   if (!buf)
-   return -ENOMEM;
-
-   ret = copy_from_user(buf, user_buf, bytes);
-   if (ret) {
-   ret = -EFAULT;
-   goto err_out;
-   }
+   buf = memdup_user(user_buf, bytes);
+   if (IS_ERR(buf))
+   return PTR_ERR(buf);
 
mutex_lock(&wl->mutex);
 
@@ -1197,7 +1191,6 @@ static ssize_t dev_mem_write(struct file *file, const 
char __user *user_buf,
if (ret == 0)
*ppos += bytes;
 
-err_out:
kfree(buf);
 
return ((ret == 0) ? bytes : ret);
-- 
2.9.3

[PATCH] yam: use memdup_user

2017-05-06 Thread Geliang Tang

Use memdup_user() helper instead of open-coding to simplify the code.

Signed-off-by: Geliang Tang 
---
 drivers/net/hamradio/yam.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hamradio/yam.c b/drivers/net/hamradio/yam.c
index b6891ad..7a7c522 100644
--- a/drivers/net/hamradio/yam.c
+++ b/drivers/net/hamradio/yam.c
@@ -976,12 +976,10 @@ static int yam_ioctl(struct net_device *dev, struct ifreq 
*ifr, int cmd)
case SIOCYAMSMCS:
if (netif_running(dev))
return -EINVAL; /* Cannot change this parameter 
when up */
-   if ((ym = kmalloc(sizeof(struct yamdrv_ioctl_mcs), GFP_KERNEL)) 
== NULL)
-   return -ENOBUFS;
-   if (copy_from_user(ym, ifr->ifr_data, sizeof(struct 
yamdrv_ioctl_mcs))) {
-   kfree(ym);
-   return -EFAULT;
-   }
+   ym = memdup_user(ifr->ifr_data,
+sizeof(struct yamdrv_ioctl_mcs));
+   if (IS_ERR(ym))
+   return PTR_ERR(ym);
if (ym->bitrate > YAM_MAXBITRATE) {
kfree(ym);
return -EINVAL;
-- 
2.9.3

[PATCH] xfrm: use memdup_user

2017-05-06 Thread Geliang Tang

Use memdup_user() helper instead of open-coding to simplify the code.

Signed-off-by: Geliang Tang 
---
 net/xfrm/xfrm_state.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fc3c5aa..5780cda 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2023,13 +2023,9 @@ int xfrm_user_policy(struct sock *sk, int optname, u8 
__user *optval, int optlen
if (optlen <= 0 || optlen > PAGE_SIZE)
return -EMSGSIZE;
 
-   data = kmalloc(optlen, GFP_KERNEL);
-   if (!data)
-   return -ENOMEM;
-
-   err = -EFAULT;
-   if (copy_from_user(data, optval, optlen))
-   goto out;
+   data = memdup_user(optval, optlen);
+   if (IS_ERR(data))
+   return PTR_ERR(data);
 
err = -EINVAL;
rcu_read_lock();
@@ -2047,7 +2043,6 @@ int xfrm_user_policy(struct sock *sk, int optname, u8 
__user *optval, int optlen
err = 0;
}
 
-out:
kfree(data);
return err;
 }
-- 
2.9.3

[PATCH] net/hippi/rrunner: use memdup_user

2017-05-06 Thread Geliang Tang

Use memdup_user() helper instead of open-coding to simplify the code.

Signed-off-by: Geliang Tang 
---
 drivers/net/hippi/rrunner.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/hippi/rrunner.c b/drivers/net/hippi/rrunner.c
index 9b0d614..1ce6239 100644
--- a/drivers/net/hippi/rrunner.c
+++ b/drivers/net/hippi/rrunner.c
@@ -1616,17 +1616,14 @@ static int rr_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
return -EPERM;
}
 
-   image = kmalloc(EEPROM_WORDS * sizeof(u32), GFP_KERNEL);
-   oldimage = kmalloc(EEPROM_WORDS * sizeof(u32), GFP_KERNEL);
-   if (!image || !oldimage) {
-   error = -ENOMEM;
-   goto wf_out;
-   }
+   image = memdup_user(rq->ifr_data, EEPROM_BYTES);
+   if (IS_ERR(image))
+   return PTR_ERR(image);
 
-   error = copy_from_user(image, rq->ifr_data, EEPROM_BYTES);
-   if (error) {
-   error = -EFAULT;
-   goto wf_out;
+   oldimage = kmalloc(EEPROM_BYTES, GFP_KERNEL);
+   if (!oldimage) {
+   kfree(image);
+   return -ENOMEM;
}
 
if (rrpriv->fw_running){
-- 
2.9.3

[PATCH] wil6210: use memdup_user

2017-05-06 Thread Geliang Tang

Use memdup_user() helper instead of open-coding to simplify the code.

Signed-off-by: Geliang Tang 
---
 drivers/net/wireless/ath/wil6210/debugfs.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/wil6210/debugfs.c 
b/drivers/net/wireless/ath/wil6210/debugfs.c
index 5648ebb..5b0f9fc 100644
--- a/drivers/net/wireless/ath/wil6210/debugfs.c
+++ b/drivers/net/wireless/ath/wil6210/debugfs.c
@@ -795,15 +795,11 @@ static ssize_t wil_write_file_txmgmt(struct file *file, 
const char __user *buf,
struct wireless_dev *wdev = wil_to_wdev(wil);
struct cfg80211_mgmt_tx_params params;
int rc;
-   void *frame = kmalloc(len, GFP_KERNEL);
+   void *frame;
 
-   if (!frame)
-   return -ENOMEM;
-
-   if (copy_from_user(frame, buf, len)) {
-   kfree(frame);
-   return -EIO;
-   }
+   frame = memdup_user(buf, len);
+   if (IS_ERR(frame))
+   return PTR_ERR(frame);
 
params.buf = frame;
params.len = len;
-- 
2.9.3

[PATCH RFC net-next 0/6] net: reducing memory footprint of network devices

2017-05-06 Thread David Ahern

As I have mentioned many times[1], at ~43+kB per instance the use of
net_devices does not scale for deployments needing 10,000+ devices. At
netconf 1.2 there was a discussion about using a net_device_common for
the minimal set of common attributes with other structs built on top of
that one for "full" devices. It provided a means for the code to know
"non-standard" net_devices. Conceptually, that approach has its merits
but it is not practical given the sweeping changes required to the code
base. More importantly though struct net_device is not the problem; it
weighs in at less than 2kB so reorganizing the code base around a
refactored net_device is not going to solve the problem. The primary
issue is all of the initializations done *because* it is a struct
net_device -- kobject and sysfs and the protocols (e.g., ipv4, ipv6,
mpls, neighbors).

So, how do you keep the desired attributes of a net device -- network
addresses, xmit function, qdisc, netfilter rules, tcpdump -- while
lowering the overhead of a net_device instance and without sweeping
changes across net/ and drivers/net/?

This patch set introduces the concept of labeling net_devices as
"lightweight", first mentioned at netdev 1.1 [1]. Users have to opt
in to lightweight devices by passing a new attribute, IFLA_LWT_NETDEV,
in the new link request. This lightweight tag is meant for virtual
devices such as vlan, vrf, vti, and dummy where the user expects to
create a lot of them and does not want the duplication of resources.
Each device type can always opt out of a lightweight label if necessary
by failing device creates.

Labeling a virtual device as "lightweight" reduces the footprint for
device creation from ~43kB to ~6kB. That reduction in memory is obtained
by:
1. no entry in sysfs
   - kobject in net_device.device is not initialized

2. no entry in procfs
   - no sysctl option for these devices

3. deferred ipv4, ipv6, mpls initialization
   - network layer must be enabled before an address can be assigned
 or mpls labels can be processed
   - enables what Florian called L2 only devices [2]

Once the core premise of a lightweight device is accepted, follow on
patches can reduce the overhead of network initializations. e.g.,

1. remove devconf per device (ipv4 and ipv6)
   - lightweight devices use the default settings rather than replicate
 the same data for each device

2. reduce / remove / opt out of snmp mibs
   - snmp6_alloc_dev and icmpv6msg_mib_device specifically is a heavy
 hitter

Patches can also be found here:
https://github.com/dsahern/linux lwt-dev-rfc

And iproute2 here:
https://github.com/dsahern/iproute2 lwt-dev

Example:
ip li add foo lwd type vrf table 123

- creates VRF device 'foo' as a lightweight netdevice.


[1] 
http://www.netdevconf.org/1.1/proceedings/slides/ahern-aleksandrov-prabhu-scaling-network-cumulus.pdf
[2] https://www.spinics.net/lists/netdev/msg340808.html
David Ahern (6):
  net: Add accessor for kboject in a net_device
  net: Add flags argument to alloc_netdev_mqs
  net: Introduce IFF_LWT_NETDEV flag
  net: Do not intialize kobject for lightweight netdevs
  net: Delay initializations for lightweight devices
  net: add uapi for creating lightweight devices

 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c |  2 +-
 drivers/net/ethernet/tile/tilegx.c  |  2 +-
 drivers/net/tun.c   |  2 +-
 drivers/net/wireless/marvell/mwifiex/cfg80211.c |  2 +-
 include/linux/netdevice.h   | 27 --
 include/uapi/linux/if_link.h|  1 +
 net/batman-adv/sysfs.c  | 13 -
 net/bridge/br_if.c  | 12 +++--
 net/bridge/br_sysfs_br.c| 17 +++---
 net/bridge/br_sysfs_if.c|  8 ++-
 net/core/dev.c  | 71 ++---
 net/core/neighbour.c|  3 ++
 net/core/net-sysfs.c| 25 ++---
 net/core/rtnetlink.c| 10 +++-
 net/ethernet/eth.c  |  2 +-
 net/ipv4/devinet.c  | 18 ++-
 net/ipv6/addrconf.c |  9 
 net/mac80211/iface.c|  2 +-
 net/mpls/af_mpls.c  |  6 +++
 net/wireless/core.c | 15 --
 20 files changed, 190 insertions(+), 57 deletions(-)

-- 
2.11.0 (Apple Git-81)

[PATCH RFC net-next 1/6] net: Add accessor for kboject in a net_device

2017-05-06 Thread David Ahern

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  5 +
 net/batman-adv/sysfs.c| 13 +--
 net/bridge/br_if.c| 12 ++
 net/bridge/br_sysfs_br.c  | 17 +-
 net/bridge/br_sysfs_if.c  |  8 +--
 net/core/dev.c| 57 ++-
 net/core/net-sysfs.c  | 11 +
 net/wireless/core.c   | 15 +
 8 files changed, 100 insertions(+), 38 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9c23bd2efb56..305d2d42b349 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4272,6 +4272,11 @@ static inline const char *netdev_reg_state(const struct 
net_device *dev)
return " (unknown)";
 }
 
+static inline struct kobject *netdev_kobject(struct net_device *dev)
+{
+   return &dev->dev.kobj;
+}
+
 __printf(3, 4)
 void netdev_printk(const char *level, const struct net_device *dev,
   const char *format, ...);
diff --git a/net/batman-adv/sysfs.c b/net/batman-adv/sysfs.c
index 0ae8b30e4eaa..a8a7294fc054 100644
--- a/net/batman-adv/sysfs.c
+++ b/net/batman-adv/sysfs.c
@@ -735,11 +735,14 @@ static struct batadv_attribute *batadv_vlan_attrs[] = {
 
 int batadv_sysfs_add_meshif(struct net_device *dev)
 {
-   struct kobject *batif_kobject = &dev->dev.kobj;
+   struct kobject *batif_kobject = netdev_kobject(dev);
struct batadv_priv *bat_priv = netdev_priv(dev);
struct batadv_attribute **bat_attr;
int err;
 
+   if (!batif_kobject)
+   return 0;
+
bat_priv->mesh_obj = kobject_create_and_add(BATADV_SYSFS_IF_MESH_SUBDIR,
batif_kobject);
if (!bat_priv->mesh_obj) {
@@ -778,6 +781,9 @@ void batadv_sysfs_del_meshif(struct net_device *dev)
struct batadv_priv *bat_priv = netdev_priv(dev);
struct batadv_attribute **bat_attr;
 
+   if (!bat_priv->mesh_obj)
+   return;
+
for (bat_attr = batadv_mesh_attrs; *bat_attr; ++bat_attr)
sysfs_remove_file(bat_priv->mesh_obj, &((*bat_attr)->attr));
 
@@ -1132,10 +1138,13 @@ static struct batadv_attribute *batadv_batman_attrs[] = 
{
 
 int batadv_sysfs_add_hardif(struct kobject **hardif_obj, struct net_device 
*dev)
 {
-   struct kobject *hardif_kobject = &dev->dev.kobj;
+   struct kobject *hardif_kobject = netdev_kobject(dev);
struct batadv_attribute **bat_attr;
int err;
 
+   if (!hardif_kobject)
+   return 0;
+
*hardif_obj = kobject_create_and_add(BATADV_SYSFS_IF_BAT_SUBDIR,
 hardif_kobject);
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 7f8d05cf9065..a5354436ada8 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -485,6 +485,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
struct net_bridge_port *p;
int err = 0;
unsigned br_hr, dev_hr;
+   struct kobject *kobj;
bool changed_addr;
 
/* Don't allow bridging non-ethernet like devices, or DSA-enabled
@@ -521,10 +522,13 @@ int br_add_if(struct net_bridge *br, struct net_device 
*dev)
if (err)
goto put_back;
 
-   err = kobject_init_and_add(&p->kobj, &brport_ktype, &(dev->dev.kobj),
-  SYSFS_BRIDGE_PORT_ATTR);
-   if (err)
-   goto err1;
+   kobj = netdev_kobject(dev);
+   if (kobj) {
+   err = kobject_init_and_add(&p->kobj, &brport_ktype, kobj,
+  SYSFS_BRIDGE_PORT_ATTR);
+   if (err)
+   goto err1;
+   }
 
err = br_sysfs_addif(p);
if (err)
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 0b5dd607444c..f6439664ffea 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -917,10 +917,13 @@ static struct bin_attribute bridge_forward = {
  */
 int br_sysfs_addbr(struct net_device *dev)
 {
-   struct kobject *brobj = &dev->dev.kobj;
+   struct kobject *brobj = netdev_kobject(dev);
struct net_bridge *br = netdev_priv(dev);
int err;
 
+   if (!brobj)
+   return 0;
+
err = sysfs_create_group(brobj, &bridge_group);
if (err) {
pr_info("%s: can't create group %s/%s\n",
@@ -944,9 +947,9 @@ int br_sysfs_addbr(struct net_device *dev)
}
return 0;
  out3:
-   sysfs_remove_bin_file(&dev->dev.kobj, &bridge_forward);
+   sysfs_remove_bin_file(brobj, &bridge_forward);
  out2:
-   sysfs_remove_group(&dev->dev.kobj, &bridge_group);
+   sysfs_remove_group(brobj, &bridge_group);
  out1:
return err;
 
@@ -954,10 +957,12 @@ int br_sysfs_addbr(struct net_device *dev)
 
 void br_sysfs_delbr(struct net_device *dev)
 {
-   struct kobject *kobj = &dev->dev.kobj;
+   struct kobject *kobj = netdev_kobject(

[PATCH RFC net-next 3/6] net: Introduce IFF_LWT_NETDEV flag

2017-05-06 Thread David Ahern

Add new flag to denote lightweight netdevices. Add helper to identify
such devices.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f47c8712398a..08151fd34973 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,6 +1401,7 @@ enum netdev_priv_flags {
IFF_RXFH_CONFIGURED = 1<<25,
IFF_PHONY_HEADROOM  = 1<<26,
IFF_MACSEC  = 1<<27,
+   IFF_LWT_NETDEV  = 1<<28,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1430,6 +1431,7 @@ enum netdev_priv_flags {
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
 #define IFF_MACSEC IFF_MACSEC
+#define IFF_LWT_NETDEV IFF_LWT_NETDEV
 
 /**
  * struct net_device - The DEVICE structure.
@@ -4137,6 +4139,11 @@ static inline void skb_gso_error_unwind(struct sk_buff 
*skb, __be16 protocol,
skb->mac_len = mac_len;
 }
 
+static inline bool netif_is_lwd(struct net_device *dev)
+{
+   return !!(dev->priv_flags & IFF_LWT_NETDEV);
+}
+
 static inline bool netif_is_macsec(const struct net_device *dev)
 {
return dev->priv_flags & IFF_MACSEC;
-- 
2.11.0 (Apple Git-81)

[PATCH RFC net-next 2/6] net: Add flags argument to alloc_netdev_mqs

2017-05-06 Thread David Ahern

Used in a later patch to pass in flags at create time

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlx5/core/ipoib.c | 2 +-
 drivers/net/ethernet/tile/tilegx.c  | 2 +-
 drivers/net/tun.c   | 2 +-
 drivers/net/wireless/marvell/mwifiex/cfg80211.c | 2 +-
 include/linux/netdevice.h   | 7 ---
 net/core/dev.c  | 5 -
 net/core/rtnetlink.c| 2 +-
 net/ethernet/eth.c  | 2 +-
 net/mac80211/iface.c| 2 +-
 9 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
index 3c84e36af018..f5aaa92726a2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib.c
@@ -446,7 +446,7 @@ static struct net_device *mlx5_rdma_netdev_alloc(struct 
mlx5_core_dev *mdev,
  name, NET_NAME_UNKNOWN,
  setup,
  nch * MLX5E_MAX_NUM_TC,
- nch);
+ nch, 0);
if (!netdev) {
mlx5_core_warn(mdev, "alloc_netdev_mqs failed\n");
goto free_mdev_resources;
diff --git a/drivers/net/ethernet/tile/tilegx.c 
b/drivers/net/ethernet/tile/tilegx.c
index 7c634bc75615..f38067e260bd 100644
--- a/drivers/net/ethernet/tile/tilegx.c
+++ b/drivers/net/ethernet/tile/tilegx.c
@@ -2198,7 +2198,7 @@ static void tile_net_dev_init(const char *name, const 
uint8_t *mac)
 * template, instantiated by register_netdev(), but not for us.
 */
dev = alloc_netdev_mqs(sizeof(*priv), name, NET_NAME_UNKNOWN,
-  tile_net_setup, NR_CPUS, 1);
+  tile_net_setup, NR_CPUS, 1, 0);
if (!dev) {
pr_err("alloc_netdev_mqs(%s) failed\n", name);
return;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bbd707b9ef7a..030621621ea8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1804,7 +1804,7 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 
dev = alloc_netdev_mqs(sizeof(struct tun_struct), name,
   NET_NAME_UNKNOWN, tun_setup, queues,
-  queues);
+  queues, 0);
 
if (!dev)
return -ENOMEM;
diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c 
b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
index 7ec06bf13413..38b6570ff1cd 100644
--- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c
+++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
@@ -2960,7 +2960,7 @@ struct wireless_dev *mwifiex_add_virtual_intf(struct 
wiphy *wiphy,
 
dev = alloc_netdev_mqs(sizeof(struct mwifiex_private *), name,
   name_assign_type, ether_setup,
-  IEEE80211_NUM_ACS, 1);
+  IEEE80211_NUM_ACS, 1, 0);
if (!dev) {
mwifiex_dbg(adapter, ERROR,
"no memory available for netdevice\n");
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 305d2d42b349..f47c8712398a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3699,13 +3699,14 @@ void ether_setup(struct net_device *dev);
 struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
unsigned char name_assign_type,
void (*setup)(struct net_device *),
-   unsigned int txqs, unsigned int rxqs);
+   unsigned int txqs, unsigned int rxqs,
+   unsigned int flags);
 #define alloc_netdev(sizeof_priv, name, name_assign_type, setup) \
-   alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1)
+   alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, 1, 1, 0)
 
 #define alloc_netdev_mq(sizeof_priv, name, name_assign_type, setup, count) \
alloc_netdev_mqs(sizeof_priv, name, name_assign_type, setup, count, \
-count)
+count, 0)
 
 int register_netdev(struct net_device *dev);
 void unregister_netdev(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index f166b3bf1895..48a0252037d5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7829,6 +7829,7 @@ void netdev_freemem(struct net_device *dev)
  * @setup: callback to initialize device
  * @txqs: the number of TX subqueues to allocate
  * @rxqs: the number of RX subqueues to allocate
+ * @flags: flags to 'or' with priv_flags
  *
  * Allocates a struct net_device with private data area for driver us

[PATCH RFC net-next 4/6] net: Do not intialize kobject for lightweight netdevs

2017-05-06 Thread David Ahern

Lightweight netdevices are not added to sysfs; bypass kobject
initialization.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  3 +++
 net/core/dev.c|  9 ++---
 net/core/net-sysfs.c  | 14 +++---
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 08151fd34973..4ddd0ac7e1cb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4282,6 +4282,9 @@ static inline const char *netdev_reg_state(const struct 
net_device *dev)
 
 static inline struct kobject *netdev_kobject(struct net_device *dev)
 {
+   if (netif_is_lwd(dev))
+   return NULL;
+
return &dev->dev.kobj;
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 48a0252037d5..52bb01041d12 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7993,7 +7993,8 @@ void free_netdev(struct net_device *dev)
dev->reg_state = NETREG_RELEASED;
 
/* will free via device release */
-   put_device(&dev->dev);
+   if (!netif_is_lwd(dev))
+   put_device(&dev->dev);
 }
 EXPORT_SYMBOL(free_netdev);
 
@@ -8179,8 +8180,10 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
netdev_adjacent_add_links(dev);
 
/* Fixup kobjects */
-   err = device_rename(&dev->dev, dev->name);
-   WARN_ON(err);
+   if (!netif_is_lwd(dev)) {
+   err = device_rename(&dev->dev, dev->name);
+   WARN_ON(err);
+   }
 
/* Add the device back in the hashes */
list_netdevice(dev);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 9df53b688f5b..725348cdeb3b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1559,18 +1559,22 @@ EXPORT_SYMBOL(of_find_net_device_by_node);
  */
 void netdev_unregister_kobject(struct net_device *ndev)
 {
+   struct kobject *kobj = netdev_kobject(ndev);
struct device *dev = &(ndev->dev);
 
if (!atomic_read(&dev_net(ndev)->count))
dev_set_uevent_suppress(dev, 1);
 
-   kobject_get(&dev->kobj);
+   if (kobj)
+   kobject_get(kobj);
 
-   remove_queue_kobjects(ndev);
+   if (!netif_is_lwd(ndev))
+   remove_queue_kobjects(ndev);
 
pm_runtime_set_memalloc_noio(dev, false);
 
-   device_del(dev);
+   if (!netif_is_lwd(ndev))
+   device_del(dev);
 }
 
 /* Create sysfs entries for network device. */
@@ -1580,6 +1584,9 @@ int netdev_register_kobject(struct net_device *ndev)
const struct attribute_group **groups = ndev->sysfs_groups;
int error = 0;
 
+   if (netif_is_lwd(ndev))
+   goto pm;
+
device_initialize(dev);
dev->class = &net_class;
dev->platform_data = ndev;
@@ -1614,6 +1621,7 @@ int netdev_register_kobject(struct net_device *ndev)
return error;
}
 
+pm:
pm_runtime_set_memalloc_noio(dev, true);
 
return error;
-- 
2.11.0 (Apple Git-81)

[PATCH RFC net-next 5/6] net: Delay initializations for lightweight devices

2017-05-06 Thread David Ahern

Delay ipv4 and ipv6 initializations on lightweight netdevices until an
address is added to the device.

Skip sysctl initialization for neighbor path as well.

Signed-off-by: David Ahern 
---
 include/linux/netdevice.h |  5 +
 net/core/neighbour.c  |  3 +++
 net/ipv4/devinet.c| 18 --
 net/ipv6/addrconf.c   |  9 +
 net/mpls/af_mpls.c|  6 ++
 5 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4ddd0ac7e1cb..32d155be777a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4144,6 +4144,11 @@ static inline bool netif_is_lwd(struct net_device *dev)
return !!(dev->priv_flags & IFF_LWT_NETDEV);
 }
 
+static inline bool netif_has_sysctl(struct net_device *dev)
+{
+   return !netif_is_lwd(dev);
+}
+
 static inline bool netif_is_macsec(const struct net_device *dev)
 {
return dev->priv_flags & IFF_MACSEC;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 58b0bcc125b5..10104a7135e2 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -3123,6 +3123,9 @@ int neigh_sysctl_register(struct net_device *dev, struct 
neigh_parms *p,
char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
char *p_name;
 
+   if (dev && !netif_has_sysctl(dev))
+   return 0;
+
t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
if (!t)
goto err;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index df14815a3b8c..c5ffd3ed4b2c 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -771,8 +771,15 @@ static struct in_ifaddr *rtm_to_ifaddr(struct net *net, 
struct nlmsghdr *nlh,
 
in_dev = __in_dev_get_rtnl(dev);
err = -ENOBUFS;
-   if (!in_dev)
-   goto errout;
+   if (!in_dev) {
+   if (netif_is_lwd(dev)) {
+   in_dev = inetdev_init(dev);
+   if (IS_ERR(in_dev))
+   in_dev = NULL;
+   }
+   if (!in_dev)
+   goto errout;
+   }
 
ifa = inet_alloc_ifa();
if (!ifa)
@@ -1417,6 +1424,10 @@ static int inetdev_event(struct notifier_block *this, 
unsigned long event,
 
if (!in_dev) {
if (event == NETDEV_REGISTER) {
+   /* inet init is deferred for lightweight devices */
+   if (netif_is_lwd(dev))
+   goto out;
+
in_dev = inetdev_init(dev);
if (IS_ERR(in_dev))
return notifier_from_errno(PTR_ERR(in_dev));
@@ -2303,6 +2314,9 @@ static int devinet_sysctl_register(struct in_device *idev)
 {
int err;
 
+   if (!netif_has_sysctl(idev->dev))
+   return 0;
+
if (!sysctl_dev_name_is_allowed(idev->dev->name))
return -EINVAL;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 8d297a79b568..9814df6b7017 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3371,6 +3371,10 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
 
switch (event) {
case NETDEV_REGISTER:
+   /* inet6 init is deferred for lightweight devices */
+   if (netif_is_lwd(dev))
+   return NOTIFY_OK;
+
if (!idev && dev->mtu >= IPV6_MIN_MTU) {
idev = ipv6_add_dev(dev);
if (IS_ERR(idev))
@@ -6368,6 +6372,11 @@ static int __addrconf_sysctl_register(struct net *net, 
char *dev_name,
struct ctl_table *table;
char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
 
+   if (idev && idev->dev && !netif_has_sysctl(idev->dev)) {
+   p->sysctl_header = NULL;
+   return 0;
+   }
+
table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
if (!table)
goto out;
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 088e2b459d0f..7503d68da2ea 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1251,6 +1251,9 @@ static int mpls_dev_sysctl_register(struct net_device 
*dev,
struct ctl_table *table;
int i;
 
+   if (!netif_has_sysctl(dev))
+   return 0;
+
table = kmemdup(&mpls_dev_table, sizeof(mpls_dev_table), GFP_KERNEL);
if (!table)
goto out;
@@ -1285,6 +1288,9 @@ static void mpls_dev_sysctl_unregister(struct net_device 
*dev,
struct net *net = dev_net(dev);
struct ctl_table *table;
 
+   if (!mdev->sysctl)
+   return;
+
table = mdev->sysctl->ctl_table_arg;
unregister_net_sysctl_table(mdev->sysctl);
kfree(table);
-- 
2.11.0 (Apple Git-81)

[PATCH RFC net-next 6/6] net: add uapi for creating lightweight devices

2017-05-06 Thread David Ahern

Allow users to make new devices lightweight by setting IFLA_LWT_NETDEV
attribute in the newlink request.

Signed-off-by: David Ahern 
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8e56ac70e0d1..f57a16e542b7 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -157,6 +157,7 @@ enum {
IFLA_GSO_MAX_SIZE,
IFLA_PAD,
IFLA_XDP,
+   IFLA_LWT_NETDEV,
__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a4db1cd91c4a..9c18e6dec379 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2378,6 +2378,7 @@ struct net_device *rtnl_create_link(struct net *net,
struct net_device *dev;
unsigned int num_tx_queues = 1;
unsigned int num_rx_queues = 1;
+   unsigned int flags = 0;
 
if (tb[IFLA_NUM_TX_QUEUES])
num_tx_queues = nla_get_u32(tb[IFLA_NUM_TX_QUEUES]);
@@ -2389,8 +2390,15 @@ struct net_device *rtnl_create_link(struct net *net,
else if (ops->get_num_rx_queues)
num_rx_queues = ops->get_num_rx_queues();
 
+   if (tb[IFLA_LWT_NETDEV]) {
+   u8 lwt_dev = !!nla_get_u8(tb[IFLA_LWT_NETDEV]);
+
+   if (lwt_dev)
+   flags |= IFF_LWT_NETDEV;
+   }
+
dev = alloc_netdev_mqs(ops->priv_size, ifname, name_assign_type,
-  ops->setup, num_tx_queues, num_rx_queues, 0);
+  ops->setup, num_tx_queues, num_rx_queues, flags);
if (!dev)
return ERR_PTR(-ENOMEM);
 
-- 
2.11.0 (Apple Git-81)

[PATCH 0/2] batman-adv: Fine-tuning for three function implementations

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 18:03:45 +0200

Two update suggestions were taken into account
from static source code analysis.

Markus Elfring (2):
  Replace a seq_puts() call by seq_putc() in two functions
  Combine two seq_puts() calls into one call in batadv_nc_nodes_seq_print_text()

 net/batman-adv/bat_iv_ogm.c | 2 +-
 net/batman-adv/bat_v.c  | 2 +-
 net/batman-adv/network-coding.c | 4 +---
 3 files changed, 3 insertions(+), 5 deletions(-)

-- 
2.12.2

[PATCH 1/2] batman-adv: Replace a seq_puts() call by seq_putc() in two functions

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 17:50:13 +0200

Two single characters (line breaks) should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/batman-adv/bat_iv_ogm.c | 2 +-
 net/batman-adv/bat_v.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/batman-adv/bat_iv_ogm.c b/net/batman-adv/bat_iv_ogm.c
index 495ba7cdcb04..1f80392ab37c 100644
--- a/net/batman-adv/bat_iv_ogm.c
+++ b/net/batman-adv/bat_iv_ogm.c
@@ -1944,7 +1944,7 @@ static void batadv_iv_ogm_orig_print(struct batadv_priv 
*bat_priv,
 
batadv_iv_ogm_orig_print_neigh(orig_node, if_outgoing,
   seq);
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
batman_count++;
 
 next:
diff --git a/net/batman-adv/bat_v.c b/net/batman-adv/bat_v.c
index a36c8e7291d6..4e2724c5b33d 100644
--- a/net/batman-adv/bat_v.c
+++ b/net/batman-adv/bat_v.c
@@ -400,7 +400,7 @@ static void batadv_v_orig_print(struct batadv_priv 
*bat_priv,
   neigh_node->if_incoming->net_dev->name);
 
batadv_v_orig_print_neigh(orig_node, if_outgoing, seq);
-   seq_puts(seq, "\n");
+   seq_putc(seq, '\n');
batman_count++;
 
 next:
-- 
2.12.2

[PATCH 2/2] batman-adv: Combine two seq_puts() calls into one call in batadv_nc_nodes_seq_print_text()

2017-05-06 Thread SF Markus Elfring

From: Markus Elfring 
Date: Sat, 6 May 2017 17:57:36 +0200

A bit of text was put into a sequence by two separate function calls.
Print the same data by a single function call instead.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring 
---
 net/batman-adv/network-coding.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c
index e1f6fc72fe3e..3604d7899e2c 100644
--- a/net/batman-adv/network-coding.c
+++ b/net/batman-adv/network-coding.c
@@ -1935,9 +1935,7 @@ int batadv_nc_nodes_seq_print_text(struct seq_file *seq, 
void *offset)
list)
seq_printf(seq, "%pM ",
   nc_node->addr);
-   seq_puts(seq, "\n");
-
-   seq_puts(seq, " Outgoing: ");
+   seq_puts(seq, "\n Outgoing: ");
/* For out_nc_node to this orig_node */
list_for_each_entry_rcu(nc_node,
&orig_node->out_coding_list,
-- 
2.12.2

Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-05-06 Thread Shubham Bansal

Hi Daniel,

Thanks for the last reply about the testing of eBPF JIT.

I have one issue though, I am not able to find what BPF_ABS and
BPF_IND instruction does exactly. It not described on this link -
https://www.kernel.org/doc/Documentation/networking/filter.txt either.
Can you please tell me where I could find the description of these
instructions please?
Best,
Shubham Bansal

On Thu, Apr 6, 2017 at 6:21 PM, Daniel Borkmann  wrote:
> On 04/06/2017 01:05 PM, Shubham Bansal wrote:
>>
>> Gentle Reminder.
>
>
> Sorry for late reply.
>
>> Anybody can tell me how to test the JIT compiler ?
>
>
> There's lib/test_bpf.c, see Documentation/networking/filter.txt +1349
> for some more information. It basically contains various test cases that
> have the purpose to test the JIT with corner cases. If you see a useful
> test missing, please send a patch for it, so all other JITs can benefit
> from this as well. For extracting disassembly from a generated test case,
> check out bpf_jit_disasm (Documentation/networking/filter.txt +486).
>
> Thanks,
> Daniel

Re: Why do we need MSG_SENDPAGE_NOTLAST?

2017-05-06 Thread Eric Dumazet

Do not top-post on netdev, please.


On Sat, 2017-05-06 at 05:46 +, Ilya Lesokhin wrote:
> I don't follow.
> Why can't splice use MSG_MORE for the individual pages?
> Why does tcp_sendpage need to know if the MORE indicator is coming from the 
> user or from splice?
> 
> I also don't understand your comment about partial writes.
> 

Make sure that sendpage() wont end up with a stall on TCP, if the socket
has not enough room to store the 16 pages provided by splice() or
sendpage()

Just use MSG_SENDPAGE_NOTLAST and be happy.


> Thanks,
> Ilya
> 
> > -Original Message-
> > From: Eric Dumazet [mailto:eric.duma...@gmail.com]
> > Sent: Thursday, May 4, 2017 9:33 PM
> > To: Ilya Lesokhin 
> > Cc: netdev@vger.kernel.org; tls-fpga-sw-dev  > d...@mellanox.com>; Dave Watson 
> > Subject: Re: Why do we need MSG_SENDPAGE_NOTLAST?
> > 
> > On Thu, 2017-05-04 at 17:03 +, Ilya Lesokhin wrote:
> > > I don't understand the need for MSG_SENDPAGE_NOTLAST and I'm hoping
> > > someone can enlighten me.
> > >
> > > According to commit 35f9c09 ('tcp: tcp_sendpages() should call
> > > tcp_push() once'):
> > > "We need to call tcp_flush() at the end of the last page processed in
> > > tcp_sendpages(), or else transmits can be deferred and future sends
> > > stall."
> > >
> > > I don't understand why we need to differentiate between the user
> > > setting MSG_MORE
> > > and splice indicating that more data is going to be sent.
> > > if the user passed MSG_MORE and didn't push any extra data, isn't it
> > > the users fault?
> > > Do we need it because poorly written applications were broken when
> > > MSG_MORE was added to tcp_sendpage? Or is there a deeper reason?
> > >
> > 
> > The answer lies to how splice() is working.
> > 
> > User can issue one splice without MSG_MORE semantic, right ?
> > 
> > Still, we want an implicit MORE behavior for all individual pages, but
> > the last one.
> > 
> > 
> > > The reason I'm asking is that we are working on a kernel TLS
> > > implementation
> > > and I would like to know if we can coalesce multiple tls_sendpage
> > > calls with MSG_MORE into a single
> > > tls record or whether we must push out the record as soon as
> > > MSG_SENDPAGE_NOTLAST is cleared?
> > 
> > Make sure you handle partial writes (you want to coalesce 10 pages, but
> > stack will only take 5 of them)
> > 
> > 
>

Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING

2017-05-06 Thread Alexander Duyck

On Fri, May 5, 2017 at 8:08 PM, Ding Tianhong  wrote:
>
>
> On 2017/5/5 22:04, Alexander Duyck wrote:
>> On Thu, May 4, 2017 at 2:01 PM, Casey Leedom  wrote:
>>> | From: Alexander Duyck 
>>> | Sent: Wednesday, May 3, 2017 9:02 AM
>>> | ...
>>> | It sounds like we are more or less in agreement. My only concern is
>>> | really what we default this to. On x86 I would say we could probably
>>> | default this to disabled for existing platforms since my understanding
>>> | is that relaxed ordering doesn't provide much benefit on what is out
>>> | there right now when performing DMA through the root complex. As far
>>> | as peer-to-peer I would say we should probably look at enabling the
>>> | ability to have Relaxed Ordering enabled for some channels but not
>>> | others. In those cases the hardware needs to be smart enough to allow
>>> | for you to indicate you want it disabled by default for most of your
>>> | DMA channels, and then enabled for the select channels that are
>>> | handling the peer-to-peer traffic.
>>>
>>>   Yes, I think that we are mostly in agreement.  I had just wanted to make
>>> sure that whatever scheme was developed would allow for simultaneously
>>> supporting non-Relaxed Ordering for some PCIe End Points and Relaxed
>>> Ordering for others within the same system.  I.e. not simply
>>> enabling/disabling/etc.  based solely on System Platform Architecture.
>>>
>>>   By the way, I've started our QA folks off looking at what things look like
>>> in Linux Virtual Machines under different Hypervisors to see what
>>> information they may provide to the VM in the way of what Root Complex Port
>>> is being used, etc.  So far they've got Windows HyperV done and there
>>> there's no PCIe Fabric exposed in any way: just the attached device.  I'll
>>> have to see what pci_find_pcie_root_port() returns in that environment.
>>> Maybe NULL?
>>
>> I believe NULL is one of the options. It all depends on what qemu is
>> emulating. Most likely you won't find a pcie root port on KVM because
>> the default is to emulate an older system that only supports PCI.
>>
>>>   With your reservations (which I also share), I think that it probably
>>> makes sense to have a per-architecture definition of the "Can I Use Relaxed
>>> Ordering With TLPs Directed At This End Point" predicate, with the default
>>> being "No" for any architecture which doesn't implement the predicate.  And
>>> if the specified (struct pci_dev *) End Node is NULL, it ought to return
>>> False for that as well.  I can't see any reason to pass in the Source End
>>> Node but I may be missing something.
>>>
>>>   At this point, this is pretty far outside my level of expertise.  I'm
>>> happy to give it a go, but I'd be even happier if someone with a lot more
>>> experience in the PCIe Infrastructure were to want to carry the ball
>>> forward.  I'm not super familiar with the Linux Kernel "Rules Of
>>> Engagement", so let me know what my next step should be.  Thanks.
>>>
>>> Casey
>>
>> For now we can probably keep this on the linux-pci mailing list. Going
>> that route is the most straight forward for now since step one is
>> probably just making sure we are setting the relaxed ordering bit in
>> the setups that make sense. I would say we could probably keep it
>> simple. We just need to enable relaxed ordering by default for SPARC
>> architectures, on most others we can probably default it to off.
>>
>
> Casey, Alexander:
>
> Thanks for the wonderful discussion, it is more clearly that what to do next,
> I agree that enable relaxed ordering by default only for SPARC and ARM64
> is more safe for all the other platform, as no one want to break anything.
>
>> I believe this all had started as Ding Tianhong was hoping to enable
>> this for the ARM architecture. That is the only one I can think of
>> where it might be difficult to figure out which way to default as we
>> were attempting to follow the same code that was enabled for SPARC and
>> that is what started this tug-of-war about how this should be done.
>> What we might do is take care of this in two phases. The first one
>> enables the infrastructure generically but leaves it defaulted to off
>> for everyone but SPARC. Then we can go through and start enabling it
>> for other platforms such as some of those on ARM in the platforms that
>> Ding Tianhong was working with.
>>
>
> According the suggestion, I could only think of this code:
>
> @@ -3979,6 +3979,15 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
>  DECLARE_PCI_FIXUP_CLASS_EARLY(0x1797, 0x6869, PCI_CLASS_NOT_DEFINED, 8,
>   quirk_tw686x_class);
>
> +static void quirk_relaxedordering_disable(struct pci_dev *dev)
> +{
> + if (dev->vendor != PCI_VENDOR_ID_HUAWEI &&
> + dev->vendor != PCI_VENDOR_ID_SUN)
> + dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
> +}
> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_INTEL_ID, PCI_ANY_ID, 
> PCI_CLASS_NOT_DEFINED, 8,
> +   quirk_relaxedorde

Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-05-06 Thread David Miller

From: Shubham Bansal 
Date: Sat, 6 May 2017 22:18:16 +0530

> Hi Daniel,
> 
> Thanks for the last reply about the testing of eBPF JIT.
> 
> I have one issue though, I am not able to find what BPF_ABS and
> BPF_IND instruction does exactly.

They are not instructions, they are modifiers for the BPF_LD
instruction which indicate an SKB load is to be performed.

You never need to ask what a BPF instruction does, it is clear
defined in the BPF interperter found in kernel/bpf/core.c

Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

[PATCH iproute2 v2 1/1] vxlan: Add support for modifying vxlan device attributes

2017-05-06 Thread Girish Moodalbail

Ability to change vxlan device attributes was added to kernel through
commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
cannot do the same through ip(8) command.  Changing the allowed vxlan
device attributes using 'ip link set dev  type vxlan
' currently fails with 'operation not supported'
error.  This failure is due to the incorrect rtnetlink message
construction for the 'ip link set' operation.

The vxlan_parse_opt() callback function is called for parsing options
for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
down default values for those attributes that were not provided as CLI
options. However, for the 'set' case we should be only passing down the
explicitly provided attributes and not any other (default) attributes.

Signed-off-by: Girish Moodalbail 
---
 ip/iplink_vxlan.c | 251 +++---
 1 file changed, 143 insertions(+), 108 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index b4ebb13..2bd619d 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -21,6 +21,8 @@
 #include "utils.h"
 #include "ip_common.h"
 
+#define VXLAN_ATTRSET(attrs, type) (((attrs) & (1L << (type))) != 0)
+
 static void print_explain(FILE *f)
 {
fprintf(f,
@@ -59,54 +61,50 @@ static void explain(void)
print_explain(stderr);
 }
 
+static void check_duparg(__u64 *attrs, int type, const char *key,
+const char *argv)
+{
+   if (!VXLAN_ATTRSET(*attrs, type)) {
+   *attrs |= (1L << type);
+   return;
+   }
+   duparg2(key, argv);
+}
+
 static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
  struct nlmsghdr *n)
 {
__u32 vni = 0;
-   int vni_set = 0;
-   __u32 saddr = 0;
__u32 gaddr = 0;
__u32 daddr = 0;
-   struct in6_addr saddr6 = IN6ADDR_ANY_INIT;
struct in6_addr gaddr6 = IN6ADDR_ANY_INIT;
struct in6_addr daddr6 = IN6ADDR_ANY_INIT;
-   unsigned int link = 0;
-   __u8 tos = 0;
-   __u8 ttl = 0;
-   __u32 label = 0;
__u8 learning = 1;
-   __u8 proxy = 0;
-   __u8 rsc = 0;
-   __u8 l2miss = 0;
-   __u8 l3miss = 0;
-   __u8 noage = 0;
-   __u32 age = 0;
-   __u32 maxaddr = 0;
__u16 dstport = 0;
-   __u8 udpcsum = 0;
-   bool udpcsum_set = false;
-   __u8 udp6zerocsumtx = 0;
-   bool udp6zerocsumtx_set = false;
-   __u8 udp6zerocsumrx = 0;
-   bool udp6zerocsumrx_set = false;
-   __u8 remcsumtx = 0;
-   __u8 remcsumrx = 0;
__u8 metadata = 0;
-   __u8 gbp = 0;
-   __u8 gpe = 0;
-   int dst_port_set = 0;
-   struct ifla_vxlan_port_range range = { 0, 0 };
+   __u64 attrs = 0;
+   bool set_op = (n->nlmsg_type == RTM_NEWLINK &&
+  !(n->nlmsg_flags & NLM_F_CREATE));
 
while (argc > 0) {
if (!matches(*argv, "id") ||
!matches(*argv, "vni")) {
+   /* We will add ID attribute outside of the loop since we
+* need to consider metadata information as well.
+*/
NEXT_ARG();
+   check_duparg(&attrs, IFLA_VXLAN_ID, "id", *argv);
if (get_u32(&vni, *argv, 0) ||
vni >= 1u << 24)
invarg("invalid id", *argv);
-   vni_set = 1;
} else if (!matches(*argv, "group")) {
+   if (daddr || !IN6_IS_ADDR_UNSPECIFIED(&daddr6)) {
+   fprintf(stderr, "vxlan: both group and remote");
+   fprintf(stderr, " cannot be specified\n");
+   return -1;
+   }
NEXT_ARG();
+   check_duparg(&attrs, IFLA_VXLAN_GROUP, "group", *argv);
if (!inet_get_addr(*argv, &gaddr, &gaddr6)) {
fprintf(stderr, "Invalid address \"%s\"\n", 
*argv);
return -1;
@@ -114,7 +112,13 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
if (!IN6_IS_ADDR_MULTICAST(&gaddr6) && 
!IN_MULTICAST(ntohl(gaddr)))
invarg("invalid group address", *argv);
} else if (!matches(*argv, "remote")) {
+   if (gaddr || !IN6_IS_ADDR_UNSPECIFIED(&gaddr6)) {
+   fprintf(stderr, "vxlan: both group and remote");
+   fprintf(stderr, " cannot be specified\n");
+   return -1;
+   }
NEXT_ARG();
+   check_duparg(&attrs, IFLA_VXLAN_GROUP, "remote", *argv);
if (!inet_get_addr(*argv, &daddr, &daddr6)) {

[PATCH iproute2 v2 0/1] vxlan: support for modifying vxlan device

2017-05-06 Thread Girish Moodalbail

Hello all,

This patch adds support for modifying VXLAN device attributes. I have
refactored the vxlan_parse_opt() function to be more readable and not
use lot of bool variables.

I have tested my changes by running Linux Test Project's VXLAN
testcases, and I didn't see any regression.
---
v1->v2
- refactored vxlan_parse_opt() to not to use a bunch of
  foo_set variables

Girish Moodalbail (1):
  vxlan: Add support for modifying vxlan device attributes

 ip/iplink_vxlan.c | 251 +++---
 1 file changed, 143 insertions(+), 108 deletions(-)

-- 
1.8.3.1

Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-05-06 Thread Shubham Bansal

Thanks David.

Hi all,

I have two questions about the code at arch/arm64/net/bpf_jit_comp.c.

1. At line 708, " const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct
sk_buff *skb */ ".
Why is this code using BPF_REG_1 before saving it? As far as I
know, BPF_REG_1 has pointer to bpf program context and this code
clearly is overwriting that pointer which makes that pointer useless
for future usage. It clearly looks like a bug.

2. At line 256, " emit(A64_LDR64(prg, tmp, r3), ctx); ".
This line of code is used to load an array( of pointers ) element,
where r3 is used as an index of that array. Shouldn't it be be
arithmetic left shifted by 3 or multiplied by 8 to get the right
address in that array of pointers ?

Apologies if any of the above question is stupid to ask.

Best,
Shubham
Best,
Shubham Bansal

On Sun, May 7, 2017 at 12:08 AM, David Miller  wrote:
> From: Shubham Bansal 
> Date: Sat, 6 May 2017 22:18:16 +0530
>
>> Hi Daniel,
>>
>> Thanks for the last reply about the testing of eBPF JIT.
>>
>> I have one issue though, I am not able to find what BPF_ABS and
>> BPF_IND instruction does exactly.
>
> They are not instructions, they are modifiers for the BPF_LD
> instruction which indicate an SKB load is to be performed.
>
> You never need to ask what a BPF instruction does, it is clear
> defined in the BPF interperter found in kernel/bpf/core.c
>
> Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

[PATCH iproute2] tc: bpf: add ppc64 and sparc64 to list of archs with eBPF support

2017-05-06 Thread Alexander Alemayhu

sparc64 support was added in 7a12b5031c6b (sparc64: Add eBPF JIT., 
2017-04-17)[0]
and ppc64 in 156d0e290e96 (powerpc/ebpf/jit: Implement JIT compiler for 
extended BPF, 2016-06-22)[1].

[0]: 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=7a12b5031c6b
[1]: 
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=156d0e290e96
Signed-off-by: Alexander Alemayhu 
---
 man/man8/tc-bpf.8 | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
index e371964d06ab..2e9812ede028 100644
--- a/man/man8/tc-bpf.8
+++ b/man/man8/tc-bpf.8
@@ -75,9 +75,9 @@ In Linux, it's generally considered that eBPF is the 
successor of cBPF.
 The kernel internally transforms cBPF expressions into eBPF expressions and
 executes the latter. Execution of them can be performed in an interpreter
 or at setup time, they can be just-in-time compiled (JIT'ed) to run as
-native machine code. Currently, x86_64, ARM64 and s390 architectures have
-eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not
-(yet) switch to eBPF JIT support.
+native machine code. Currently, x86_64, ARM64, s390, ppc64 and sparc64
+architectures have eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have
+cBPF, but did not (yet) switch to eBPF JIT support.
 
 eBPF's instruction set has similar underlying principles as the cBPF
 instruction set, it however is modelled closer to the underlying
-- 
2.7.4

[RFC PATCH 0/3] udp: scalability improvements

2017-05-06 Thread Paolo Abeni

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is in the 20-35% range depending on
the numa layout and the number of ingress queue used by the relevant nic.

On a single numa node host, the peak tput is now reached when the traffic
targeting the udp socket uses multiple nic rx queues, while on current net-next
the tput always decreases when moving from a single rx queue to multiple ones.


Paolo Abeni (3):
  net/sock: factor out dequeue/peek with offset code
  udp: use a separate rx queue for packet reception
  udp: keep the sk_receive_queue held when splicing

 include/linux/skbuff.h |   7 +++
 include/linux/udp.h|   3 +
 include/net/sock.h |   4 +-
 include/net/udp.h  |   9 +--
 include/net/udplite.h  |   2 +-
 net/core/datagram.c|  90 +++
 net/ipv4/udp.c | 162 +++--
 net/ipv6/udp.c |   3 +-
 8 files changed, 211 insertions(+), 69 deletions(-)

-- 
2.9.3

[RFC PATCH 3/3] udp: keep the sk_receive_queue held when splicing

2017-05-06 Thread Paolo Abeni

On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 36 ++--
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 492c76b..d698973 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1164,7 +1164,8 @@ int udp_sendpage(struct sock *sk, struct page *page, int 
offset,
 }
 
 /* fully reclaim rmem/fwd memory allocated for skb */
-static void udp_rmem_release(struct sock *sk, int size, int partial)
+static void udp_rmem_release(struct sock *sk, int size, int partial,
+int rx_queue_lock_held)
 {
struct udp_sock *up = udp_sk(sk);
struct sk_buff_head *sk_queue;
@@ -1181,9 +1182,13 @@ static void udp_rmem_release(struct sock *sk, int size, 
int partial)
}
up->forward_deficit = 0;
 
-   /* acquire the sk_receive_queue for fwd allocated memory scheduling */
+   /* acquire the sk_receive_queue for fwd allocated memory scheduling,
+* if the called don't held it already
+*/
sk_queue = &sk->sk_receive_queue;
-   spin_lock(&sk_queue->lock);
+   if (!rx_queue_lock_held)
+   spin_lock(&sk_queue->lock);
+
 
sk->sk_forward_alloc += size;
amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
@@ -1197,7 +1202,8 @@ static void udp_rmem_release(struct sock *sk, int size, 
int partial)
/* this can save us from acquiring the rx queue lock on next receive */
skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
 
-   spin_unlock(&sk_queue->lock);
+   if (!rx_queue_lock_held)
+   spin_unlock(&sk_queue->lock);
 }
 
 /* Note: called with reader_queue.lock held.
@@ -1207,10 +1213,16 @@ static void udp_rmem_release(struct sock *sk, int size, 
int partial)
  */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-   udp_rmem_release(sk, skb->dev_scratch, 1);
+   udp_rmem_release(sk, skb->dev_scratch, 1, 0);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* as above, but the caller held the rx queue lock, too */
+void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
+{
+   udp_rmem_release(sk, skb->dev_scratch, 1, 1);
+}
+
 /* Idea of busylocks is to let producers grab an extra spinlock
  * to relieve pressure on the receive_queue spinlock shared by consumer.
  * Under flood, this means that only one producer can be in line
@@ -1325,7 +1337,7 @@ void udp_destruct_sock(struct sock *sk)
total += skb->truesize;
kfree_skb(skb);
}
-   udp_rmem_release(sk, total, 0);
+   udp_rmem_release(sk, total, 0, 1);
 
inet_sock_destruct(sk);
 }
@@ -1397,7 +1409,7 @@ static int first_packet_length(struct sock *sk)
}
res = skb ? skb->len : -1;
if (total)
-   udp_rmem_release(sk, total, 1);
+   udp_rmem_release(sk, total, 1, 0);
spin_unlock_bh(&rcvq->lock);
return res;
 }
@@ -1471,16 +1483,20 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, 
unsigned int flags,
goto busy_check;
}
 
-   /* refill the reader queue and walk it again */
+   /* refill the reader queue and walk it again
+* keep both queues locked to avoid re-acquiring
+* the sk_receive_queue lock if fwd memory scheduling
+* is needed.
+*/
_off = *off;
spin_lock(&sk_queue->lock);
skb_queue_splice_tail_init(sk_queue, queue);
-   spin_unlock(&sk_queue->lock);
 
skb = __skb_try_recv_from_queue(sk, queue, flags,
-   udp_skb_destructor,
+   udp_skb_dtor_locked,
peeked, &_off, err,
&last);
+   spin_unlock(&sk_queue->lock);
spin_unlock_bh(&queue->lock);
if (skb) {
*off = _off;
-- 
2.9.3

[RFC PATCH 1/3] net/sock: factor out dequeue/peek with offset code

2017-05-06 Thread Paolo Abeni

And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni 
---
 include/linux/skbuff.h |  7 
 include/net/sock.h |  4 +--
 net/core/datagram.c| 90 --
 3 files changed, 60 insertions(+), 41 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a098d95..bfc7892 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff 
*skb)
 
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
const struct sk_buff *skb);
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+ struct sk_buff_head *queue,
+ unsigned int flags,
+ void (*destructor)(struct sock *sk,
+  struct sk_buff *skb),
+ int *peeked, int *off, int *err,
+ struct sk_buff **last);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
void (*destructor)(struct sock *sk,
   struct sk_buff *skb),
diff --git a/include/net/sock.h b/include/net/sock.h
index 66349e4..49d226f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list 
*timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
-int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-   unsigned int flags,
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
+   struct sk_buff *skb, unsigned int flags,
void (*destructor)(struct sock *sk,
   struct sk_buff *skb));
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index db1866f2..a4592b4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
return skb;
 }
 
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+ struct sk_buff_head *queue,
+ unsigned int flags,
+ void (*destructor)(struct sock *sk,
+  struct sk_buff *skb),
+ int *peeked, int *off, int *err,
+ struct sk_buff **last)
+{
+   struct sk_buff *skb;
+
+   *last = queue->prev;
+   skb_queue_walk(queue, skb) {
+   if (flags & MSG_PEEK) {
+   if (*off >= skb->len && (skb->len || *off ||
+skb->peeked)) {
+   *off -= skb->len;
+   continue;
+   }
+   if (!skb->len) {
+   skb = skb_set_peeked(skb);
+   if (unlikely(IS_ERR(skb))) {
+   *err = PTR_ERR(skb);
+   return skb;
+   }
+   }
+   *peeked = 1;
+   atomic_inc(&skb->users);
+   } else {
+   __skb_unlink(skb, queue);
+   if (destructor)
+   destructor(sk, skb);
+   }
+   return skb;
+   }
+   return NULL;
+}
+
 /**
  * __skb_try_recv_datagram - Receive a datagram skbuff
  * @sk: socket
@@ -216,46 +253,20 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, 
unsigned int flags,
 
*peeked = 0;
do {
+   int _off = *off;
+
/* Again only user level code calls this function, so nothing
 * interrupt level will suddenly eat the receive_queue.
 *
 * Look at current nfs client by the way...
 * However, this function was correct in any case. 8)
 */
-   int _off = *off;
-
-   *last = (struct sk_buff *)queue;
spin_lock_irqsave(&queue->lock, cpu_flags);
-   skb_queue_walk(queue, skb) {
-   *last = skb;
-   if (flags & MSG_PEEK) {
-   if (_off >= skb->len && (skb->len || _off ||
-skb->peeked)) {
-

[RFC PATCH 2/3] udp: use a separate rx queue for packet reception

2017-05-06 Thread Paolo Abeni

under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h   |   3 ++
 include/net/udp.h |   9 +---
 include/net/udplite.h |   2 +-
 net/ipv4/udp.c| 138 --
 net/ipv6/udp.c|   3 +-
 5 files changed, 131 insertions(+), 24 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 6cb4061..eaea63b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -80,6 +80,9 @@ struct udp_sock {
struct sk_buff *skb,
int nhoff);
 
+   /* udp_recvmsg try to use this before splicing sk_receive_queue */
+   struct sk_buff_head reader_queue cacheline_aligned_in_smp;
+
/* This field is dirtied by udp_recvmsg() */
int forward_deficit;
 };
diff --git a/include/net/udp.h b/include/net/udp.h
index 3391dbd..1468dbd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -249,13 +249,8 @@ void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb);
-static inline struct sk_buff *
-__skb_recv_udp(struct sock *sk, unsigned int flags, int noblock, int *peeked,
-  int *off, int *err)
-{
-   return __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
-  udp_skb_destructor, peeked, off, err);
-}
+struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
+  int noblock, int *peeked, int *off, int *err);
 static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags,
   int noblock, int *err)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index ea34052..b7a18f6 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -26,8 +26,8 @@ static __inline__ int udplite_getfrag(void *from, char *to, 
int  offset,
 /* Designate sk as UDP-Lite socket */
 static inline int udplite_sk_init(struct sock *sk)
 {
+   udp_init_sock(sk);
udp_sk(sk)->pcflag = UDPLITE_BIT;
-   sk->sk_destruct = udp_destruct_sock;
return 0;
 }
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ea6e4cf..492c76b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1167,19 +1167,24 @@ int udp_sendpage(struct sock *sk, struct page *page, 
int offset,
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
struct udp_sock *up = udp_sk(sk);
+   struct sk_buff_head *sk_queue;
int amt;
 
if (likely(partial)) {
up->forward_deficit += size;
size = up->forward_deficit;
if (size < (sk->sk_rcvbuf >> 2) &&
-   !skb_queue_empty(&sk->sk_receive_queue))
+   !skb_queue_empty(&up->reader_queue))
return;
} else {
size += up->forward_deficit;
}
up->forward_deficit = 0;
 
+   /* acquire the sk_receive_queue for fwd allocated memory scheduling */
+   sk_queue = &sk->sk_receive_queue;
+   spin_lock(&sk_queue->lock);
+
sk->sk_forward_alloc += size;
amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
sk->sk_forward_alloc -= amt;
@@ -1188,9 +1193,14 @@ static void udp_rmem_release(struct sock *sk, int size, 
int partial)
__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 
atomic_sub(size, &sk->sk_rmem_alloc);
+
+   /* this can save us from acquiring the rx queue lock on next receive */
+   skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
+
+   spin_unlock(&sk_queue->lock);
 }
 
-/* Note: called with sk_receive_queue.lock held.
+/* Note: called with reader_queue.lock held.
  * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
  * This avoids a cache line miss while receive_queue lock is held.
  * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
@@ -1306,10 +1316,12 @@ EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
 void udp_destruct_sock(stru

Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-05-06 Thread Shubham Bansal

Okay. My mistake. I just checked the verify function.

Apologies.
Best,
Shubham Bansal


On Sun, May 7, 2017 at 1:57 AM, Shubham Bansal
 wrote:
> Thanks David.
>
> Hi all,
>
> I have two questions about the code at arch/arm64/net/bpf_jit_comp.c.
>
> 1. At line 708, " const u8 r1 = bpf2a64[BPF_REG_1]; /* r1: struct
> sk_buff *skb */ ".
> Why is this code using BPF_REG_1 before saving it? As far as I
> know, BPF_REG_1 has pointer to bpf program context and this code
> clearly is overwriting that pointer which makes that pointer useless
> for future usage. It clearly looks like a bug.
>
> 2. At line 256, " emit(A64_LDR64(prg, tmp, r3), ctx); ".
> This line of code is used to load an array( of pointers ) element,
> where r3 is used as an index of that array. Shouldn't it be be
> arithmetic left shifted by 3 or multiplied by 8 to get the right
> address in that array of pointers ?
>
> Apologies if any of the above question is stupid to ask.
>
> Best,
> Shubham
> Best,
> Shubham Bansal
>
>
> On Sun, May 7, 2017 at 12:08 AM, David Miller  wrote:
>> From: Shubham Bansal 
>> Date: Sat, 6 May 2017 22:18:16 +0530
>>
>>> Hi Daniel,
>>>
>>> Thanks for the last reply about the testing of eBPF JIT.
>>>
>>> I have one issue though, I am not able to find what BPF_ABS and
>>> BPF_IND instruction does exactly.
>>
>> They are not instructions, they are modifiers for the BPF_LD
>> instruction which indicate an SKB load is to be performed.
>>
>> You never need to ask what a BPF instruction does, it is clear
>> defined in the BPF interperter found in kernel/bpf/core.c
>>
>> Look for the case statement LD_ABS_W and friends in __bpf_prog_run().

Re: [RFC PATCH 0/3] udp: scalability improvements

2017-05-06 Thread Tom Herbert

On Sat, May 6, 2017 at 1:42 PM, Paolo Abeni  wrote:
> This patch series implement an idea suggested by Eric Dumazet to
> reduce the contention of the udp sk_receive_queue lock when the socket is
> under flood.
>
> An ancillary queue is added to the udp socket, and the socket always
> tries first to read packets from such queue. If it's empty, we splice
> the content from sk_receive_queue into the ancillary queue.
>
> The first patch introduces some helpers to keep the udp code small, and the
> following two implement the ancillary queue strategy. The code is split
> to hopefully help the reviewing process.
>
> The measured overall gain under udp flood is in the 20-35% range depending on
> the numa layout and the number of ingress queue used by the relevant nic.
>
Certainly sounds good, but can you give real reproducible performance
numbers including the test that was run?

Tom

> On a single numa node host, the peak tput is now reached when the traffic
> targeting the udp socket uses multiple nic rx queues, while on current 
> net-next
> the tput always decreases when moving from a single rx queue to multiple ones.
>
>
> Paolo Abeni (3):
>   net/sock: factor out dequeue/peek with offset code
>   udp: use a separate rx queue for packet reception
>   udp: keep the sk_receive_queue held when splicing
>
>  include/linux/skbuff.h |   7 +++
>  include/linux/udp.h|   3 +
>  include/net/sock.h |   4 +-
>  include/net/udp.h  |   9 +--
>  include/net/udplite.h  |   2 +-
>  net/core/datagram.c|  90 +++
>  net/ipv4/udp.c | 162 
> +++--
>  net/ipv6/udp.c |   3 +-
>  8 files changed, 211 insertions(+), 69 deletions(-)
>
> --
> 2.9.3
>

43 matches

Mail list logo