Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Harald Welte  wrote:
> I believe _if_ one wants to use the approach of "hiding" eBPF behind
> iptables, then either
[..]
> b) you must introduce new 'tables', like an 'xdp' table which then has
>the notion of processing very early in processing, way before the
>normal filter table INPUT processing happens.

In nftables. the netdev ingress hook location could be used for this,
but right, iptables has no equivalent.

netdev ingress is interesting from an hw-offload point of view,
unlike all other netfilter hooks its tied to a specific network interface
rather than owned by the network namespace.

A rule like (yes i am making this up)
limit 1 byte/s

cannot be offloaded because it affects all packets going through
the system, i.e. you'd need to share state among all nics which i think
won't work :-)

Same goes for any other match/target that somehow contains (global)
state and was added to the 'classic' iptables hook points.
(exception: rule restricts interface via '-i foo').


Note well: "offloaded != ebpf" in this case.

I see no reasons why ebpf cannot be used in either iptables or
nftables.  How to get there is obviously a different beast.

For iptables, I think we should put it in maintenance mode and
focus on nftables, for many reasons outlined in other replies.

And how to best make use of ebpf+nftables

In ideal world, nftables would have used (e)bpf from the start.
But, well, its not an ideal world (iirc nft origins are just a bit
too old).

That doesn't mean that we can't leverage ebpf from nftables.
Its just a question of where it makes sense and where it doesn't,
f.e. i see no reason to replace c code with ebpf just 'because you can'.

Speedup?  Good argument.
Feature enhancements that could use ebpf programs? Another good
argument.

I guess there are a lot more.

So I'd like to second Haralds question.

What is the main goal?

For nftables, I believe most important ones are:
- make kernel keeper/owner of all rules
- allow userspace to learn of rule addition/deletion
- provide fast matching (no linear evaluation of rules,
native sets with jump and verdict maps)
- provide a single tool instead of ip/ip6/arp/ebtables
- unified ipv4/ipv6 matching
- backwards compat and/or translation infrastructure


But once these are reached, we will hopefully have more:
- offloading (hardware)
- speedup via JIT compilation
- feature enhancements such as matching arbitrary packet
contents

I suspect you see that ebpf might be a fit and/or help us with
all of these things.

So, once I understand what your goals are I might be better able
to see how nftables could fit into the picture, as you can see
I did a lot of guesswork :-)


Re: [PATCH 0/3] Allow 'ip rule' command to use protocol

2018-02-17 Thread Donald Sharp
Got it.  I'll send an update.

donald

On Sat, Feb 17, 2018 at 6:35 PM, David Ahern  wrote:
> On 2/17/18 5:47 AM, Donald Sharp wrote:
>> Fix iprule.c to use the actual `struct fib_rule_hdr` and to
>> allow the end user to see and use the protocol keyword
>> for rule manipulations.
>>
>> Donald Sharp (3):
>>   ip: Use the `struct fib_rule_hdr` for rules
>>   ip: Display ip rule protocol used
>>   ip: Allow rules to accept a specified protocol
>>
>>  include/linux/fib_rules.h |   2 +-
>>  ip/iprule.c   | 114 
>> ++
>>  2 files changed, 65 insertions(+), 51 deletions(-)
>>
>
> you are missing a patch to add protocol to iprule_list_flush_or_save so
> 'ip ru flush proto NAME' flushes all rules with that protocol.


[PATCH net-next4/5] ibmvnic: Make napi usage dynamic

2018-02-17 Thread Nathan Fontenot
In order to handle the number of rx sub crqs changing during a driver
reset, the ibmvnic driver also needs to update the number of napi.
To do this the code to init and free napi's is moved to their own
routines so they can be called during the reset process.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   67 
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index a3079d5c072c..f4261f03a522 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -730,6 +730,43 @@ static void ibmvnic_napi_disable(struct ibmvnic_adapter 
*adapter)
adapter->napi_enabled = false;
 }
 
+static int init_napi(struct ibmvnic_adapter *adapter)
+{
+   int i;
+
+   adapter->napi = kcalloc(adapter->req_rx_queues,
+   sizeof(struct napi_struct), GFP_KERNEL);
+   if (!adapter->napi)
+   return -ENOMEM;
+
+   for (i = 0; i < adapter->req_rx_queues; i++) {
+   netdev_dbg(adapter->netdev, "Adding napi[%d]\n", i);
+   netif_napi_add(adapter->netdev, >napi[i],
+  ibmvnic_poll, NAPI_POLL_WEIGHT);
+   }
+
+   return 0;
+}
+
+static void release_napi(struct ibmvnic_adapter *adapter)
+{
+   int i;
+
+   if (!adapter->napi)
+   return;
+
+   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
+   if (>napi[i]) {
+   netdev_dbg(adapter->netdev,
+  "Releasing napi[%d]\n", i);
+   netif_napi_del(>napi[i]);
+   }
+   }
+
+   kfree(adapter->napi);
+   adapter->napi = NULL;
+}
+
 static int ibmvnic_login(struct net_device *netdev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
@@ -783,8 +820,6 @@ static int ibmvnic_login(struct net_device *netdev)
 
 static void release_resources(struct ibmvnic_adapter *adapter)
 {
-   int i;
-
release_vpd_data(adapter);
 
release_tx_pools(adapter);
@@ -793,16 +828,7 @@ static void release_resources(struct ibmvnic_adapter 
*adapter)
release_stats_token(adapter);
release_stats_buffers(adapter);
release_error_buffers(adapter);
-
-   if (adapter->napi) {
-   for (i = 0; i < adapter->req_rx_queues; i++) {
-   if (>napi[i]) {
-   netdev_dbg(adapter->netdev,
-  "Releasing napi[%d]\n", i);
-   netif_napi_del(>napi[i]);
-   }
-   }
-   }
+   release_napi(adapter);
 }
 
 static int set_link_state(struct ibmvnic_adapter *adapter, u8 link_state)
@@ -921,7 +947,7 @@ static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
 static int init_resources(struct ibmvnic_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
-   int i, rc;
+   int rc;
 
rc = set_real_num_queues(netdev);
if (rc)
@@ -947,16 +973,10 @@ static int init_resources(struct ibmvnic_adapter *adapter)
}
 
adapter->map_id = 1;
-   adapter->napi = kcalloc(adapter->req_rx_queues,
-   sizeof(struct napi_struct), GFP_KERNEL);
-   if (!adapter->napi)
-   return -ENOMEM;
 
-   for (i = 0; i < adapter->req_rx_queues; i++) {
-   netdev_dbg(netdev, "Adding napi[%d]\n", i);
-   netif_napi_add(netdev, >napi[i], ibmvnic_poll,
-  NAPI_POLL_WEIGHT);
-   }
+   rc = init_napi(adapter);
+   if (rc)
+   return rc;
 
send_map_query(adapter);
 
@@ -1641,6 +1661,9 @@ static int do_reset(struct ibmvnic_adapter *adapter,
init_rx_pools(netdev);
init_tx_pools(netdev);
 
+   release_napi(adapter);
+   init_napi(adapter);
+
adapter->num_active_tx_scrqs = adapter->req_tx_queues;
adapter->num_active_rx_scrqs = adapter->req_rx_queues;
} else {



[PATCH net-next3/5] ibmvnic: Free and re-allocate scrqs when tx/rx scrqs change

2018-02-17 Thread Nathan Fontenot
When the driver resets it is possible that the number of tx/rx
sub-crqs can change. This patch handles this so that the driver does
not try to access non-existent sub-crqs.

Additionally, a parameter is added to release_sub_crqs() so that
we know if the h_call to free the sub-crq needs to be made. In
the reset path we have to do a reset of the main crq, which is
a free followed by a register of the main crq. The free of main
crq results in all of the sub crq's being free'ed. When updating
sub-crq count in the reset path we do not want to h_free the
sub-crqs, they are already free'ed.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   69 ++--
 1 file changed, 43 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 9cfbb20b5ac1..a3079d5c072c 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -90,7 +90,7 @@ MODULE_VERSION(IBMVNIC_DRIVER_VERSION);
 
 static int ibmvnic_version = IBMVNIC_INITIAL_VERSION;
 static int ibmvnic_remove(struct vio_dev *);
-static void release_sub_crqs(struct ibmvnic_adapter *);
+static void release_sub_crqs(struct ibmvnic_adapter *, int);
 static int ibmvnic_reset_crq(struct ibmvnic_adapter *);
 static int ibmvnic_send_crq_init(struct ibmvnic_adapter *);
 static int ibmvnic_reenable_crq_queue(struct ibmvnic_adapter *);
@@ -740,7 +740,7 @@ static int ibmvnic_login(struct net_device *netdev)
do {
if (adapter->renegotiate) {
adapter->renegotiate = false;
-   release_sub_crqs(adapter);
+   release_sub_crqs(adapter, 1);
 
reinit_completion(>init_done);
send_cap_queries(adapter);
@@ -1602,7 +1602,7 @@ static int do_reset(struct ibmvnic_adapter *adapter,
if (adapter->reset_reason == VNIC_RESET_CHANGE_PARAM ||
adapter->wait_for_reset) {
release_resources(adapter);
-   release_sub_crqs(adapter);
+   release_sub_crqs(adapter, 1);
release_crq_queue(adapter);
}
 
@@ -2241,24 +2241,27 @@ static int reset_sub_crq_queues(struct ibmvnic_adapter 
*adapter)
 }
 
 static void release_sub_crq_queue(struct ibmvnic_adapter *adapter,
- struct ibmvnic_sub_crq_queue *scrq)
+ struct ibmvnic_sub_crq_queue *scrq,
+ int do_h_free)
 {
struct device *dev = >vdev->dev;
long rc;
 
netdev_dbg(adapter->netdev, "Releasing sub-CRQ\n");
 
-   /* Close the sub-crqs */
-   do {
-   rc = plpar_hcall_norets(H_FREE_SUB_CRQ,
-   adapter->vdev->unit_address,
-   scrq->crq_num);
-   } while (rc == H_BUSY || H_IS_LONG_BUSY(rc));
+   if (do_h_free) {
+   /* Close the sub-crqs */
+   do {
+   rc = plpar_hcall_norets(H_FREE_SUB_CRQ,
+   adapter->vdev->unit_address,
+   scrq->crq_num);
+   } while (rc == H_BUSY || H_IS_LONG_BUSY(rc));
 
-   if (rc) {
-   netdev_err(adapter->netdev,
-  "Failed to release sub-CRQ %16lx, rc = %ld\n",
-  scrq->crq_num, rc);
+   if (rc) {
+   netdev_err(adapter->netdev,
+  "Failed to release sub-CRQ %16lx, rc = 
%ld\n",
+  scrq->crq_num, rc);
+   }
}
 
dma_unmap_single(dev, scrq->msg_token, 4 * PAGE_SIZE,
@@ -2326,12 +2329,12 @@ static struct ibmvnic_sub_crq_queue 
*init_sub_crq_queue(struct ibmvnic_adapter
return NULL;
 }
 
-static void release_sub_crqs(struct ibmvnic_adapter *adapter)
+static void release_sub_crqs(struct ibmvnic_adapter *adapter, int do_h_free)
 {
int i;
 
if (adapter->tx_scrq) {
-   for (i = 0; i < adapter->req_tx_queues; i++) {
+   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
if (!adapter->tx_scrq[i])
continue;
 
@@ -2344,7 +2347,8 @@ static void release_sub_crqs(struct ibmvnic_adapter 
*adapter)
adapter->tx_scrq[i]->irq = 0;
}
 
-   release_sub_crq_queue(adapter, adapter->tx_scrq[i]);
+   release_sub_crq_queue(adapter, adapter->tx_scrq[i],
+ do_h_free);
}
 
kfree(adapter->tx_scrq);
@@ -2352,7 +2356,7 @@ static void release_sub_crqs(struct ibmvnic_adapter 
*adapter)
}
 
if (adapter->rx_scrq) {
-   for (i = 0; i < 

[PATCH net-next5/5] ibmvnic: Allocate max queues stats buffers

2018-02-17 Thread Nathan Fontenot
To avoid losing any stats when the number of sub-crqs change, allocate
the max number of stats buffers so a stats buffer exists all possible
sub-crqs.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index f4261f03a522..79a0b9bc5edb 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -361,14 +361,14 @@ static void release_stats_buffers(struct ibmvnic_adapter 
*adapter)
 static int init_stats_buffers(struct ibmvnic_adapter *adapter)
 {
adapter->tx_stats_buffers =
-   kcalloc(adapter->req_tx_queues,
+   kcalloc(IBMVNIC_MAX_QUEUES,
sizeof(struct ibmvnic_tx_queue_stats),
GFP_KERNEL);
if (!adapter->tx_stats_buffers)
return -ENOMEM;
 
adapter->rx_stats_buffers =
-   kcalloc(adapter->req_rx_queues,
+   kcalloc(IBMVNIC_MAX_QUEUES,
sizeof(struct ibmvnic_rx_queue_stats),
GFP_KERNEL);
if (!adapter->rx_stats_buffers)



[PATCH net-next2/5] ibmvnic: Move active sub-crq count settings

2018-02-17 Thread Nathan Fontenot
Inpreparation for using the active scrq count to track more active
resources, move the setting of the active count to after initialization
occurs in initial driver init and during driver reset.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index ca2e3fbfd848..9cfbb20b5ac1 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -484,7 +484,6 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->rx_pool);
adapter->rx_pool = NULL;
-   adapter->num_active_rx_scrqs = 0;
 }
 
 static int init_rx_pools(struct net_device *netdev)
@@ -509,8 +508,6 @@ static int init_rx_pools(struct net_device *netdev)
return -1;
}
 
-   adapter->num_active_rx_scrqs = 0;
-
for (i = 0; i < rxadd_subcrqs; i++) {
rx_pool = >rx_pool[i];
 
@@ -554,8 +551,6 @@ static int init_rx_pools(struct net_device *netdev)
rx_pool->next_free = 0;
}
 
-   adapter->num_active_rx_scrqs = rxadd_subcrqs;
-
return 0;
 }
 
@@ -624,7 +619,6 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
-   adapter->num_active_tx_scrqs = 0;
 }
 
 static int init_tx_pools(struct net_device *netdev)
@@ -641,8 +635,6 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;
 
-   adapter->num_active_tx_scrqs = 0;
-
for (i = 0; i < tx_subcrqs; i++) {
tx_pool = >tx_pool[i];
 
@@ -690,8 +682,6 @@ static int init_tx_pools(struct net_device *netdev)
tx_pool->producer_index = 0;
}
 
-   adapter->num_active_tx_scrqs = tx_subcrqs;
-
return 0;
 }
 
@@ -975,6 +965,10 @@ static int init_resources(struct ibmvnic_adapter *adapter)
return rc;
 
rc = init_tx_pools(netdev);
+
+   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
+   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
+
return rc;
 }
 
@@ -1646,6 +1640,9 @@ static int do_reset(struct ibmvnic_adapter *adapter,
release_tx_pools(adapter);
init_rx_pools(netdev);
init_tx_pools(netdev);
+
+   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
+   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
} else {
rc = reset_tx_pools(adapter);
if (rc)



[PATCH net-next1/5] ibmvnic: Rename active queue count variables

2018-02-17 Thread Nathan Fontenot
Rename the tx/rx active pool variables to be tx/rx active scrq
counts. The tx/rx pools are per sub-crq so this is a more appropriate
name. This also is a preparatory step for using thiese variables
for handling updates to sub-crqs and napi based on the active
count.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   16 
 drivers/net/ethernet/ibm/ibmvnic.h |4 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 27447260215d..ca2e3fbfd848 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -461,7 +461,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->rx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_rx_pools; i++) {
+   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
rx_pool = >rx_pool[i];
 
netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i);
@@ -484,7 +484,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->rx_pool);
adapter->rx_pool = NULL;
-   adapter->num_active_rx_pools = 0;
+   adapter->num_active_rx_scrqs = 0;
 }
 
 static int init_rx_pools(struct net_device *netdev)
@@ -509,7 +509,7 @@ static int init_rx_pools(struct net_device *netdev)
return -1;
}
 
-   adapter->num_active_rx_pools = 0;
+   adapter->num_active_rx_scrqs = 0;
 
for (i = 0; i < rxadd_subcrqs; i++) {
rx_pool = >rx_pool[i];
@@ -554,7 +554,7 @@ static int init_rx_pools(struct net_device *netdev)
rx_pool->next_free = 0;
}
 
-   adapter->num_active_rx_pools = rxadd_subcrqs;
+   adapter->num_active_rx_scrqs = rxadd_subcrqs;
 
return 0;
 }
@@ -613,7 +613,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->tx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_tx_pools; i++) {
+   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i);
tx_pool = >tx_pool[i];
kfree(tx_pool->tx_buff);
@@ -624,7 +624,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
-   adapter->num_active_tx_pools = 0;
+   adapter->num_active_tx_scrqs = 0;
 }
 
 static int init_tx_pools(struct net_device *netdev)
@@ -641,7 +641,7 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;
 
-   adapter->num_active_tx_pools = 0;
+   adapter->num_active_tx_scrqs = 0;
 
for (i = 0; i < tx_subcrqs; i++) {
tx_pool = >tx_pool[i];
@@ -690,7 +690,7 @@ static int init_tx_pools(struct net_device *netdev)
tx_pool->producer_index = 0;
}
 
-   adapter->num_active_tx_pools = tx_subcrqs;
+   adapter->num_active_tx_scrqs = tx_subcrqs;
 
return 0;
 }
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h 
b/drivers/net/ethernet/ibm/ibmvnic.h
index fe21a6e2ddae..c6d0b4afe899 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -1091,8 +1091,8 @@ struct ibmvnic_adapter {
u64 opt_rxba_entries_per_subcrq;
__be64 tx_rx_desc_req;
u8 map_id;
-   u64 num_active_rx_pools;
-   u64 num_active_tx_pools;
+   u64 num_active_rx_scrqs;
+   u64 num_active_tx_scrqs;
 
struct tasklet_struct tasklet;
enum vnic_state state;



[PATCH net-next 0/5] ibmvnic: Make driver resources dynamic

2018-02-17 Thread Nathan Fontenot
The ibmvnic driver needs to be able to handle the number of tx/rx
sub-crqs changing during a reset of the driver. To do this several
changes need to be made. First the num_active_[tx|rx]_pools
counters need to be re-named to num_active_[tc|rx]_scrqs, and
updated after resource initialization.

With this change we can now release and init the sub crqs and napi
(for rx sub crqs) when the number of sub crqs change.

Lastly, the stats buffer allocation is updated to always allocate
the maximum number of sub-crqs count of stats buffers.

-Nathan
---

Nathan Fontenot (5):
  ibmvnic: Rename active queue count variables
  ibmvnic: Move active sub-crq count settings
  ibmvnic: Free and re-allocate scrqs when tx/rx scrqs change
  ibmvnic: Make napi usage dynamic
  ibmvnic: Allocate max queues stats buffers


 drivers/net/ethernet/ibm/ibmvnic.c |  161 ++--
 drivers/net/ethernet/ibm/ibmvnic.h |4 -
 2 files changed, 101 insertions(+), 64 deletions(-)



Re: [PATCH 0/3] Allow 'ip rule' command to use protocol

2018-02-17 Thread David Ahern
On 2/17/18 5:47 AM, Donald Sharp wrote:
> Fix iprule.c to use the actual `struct fib_rule_hdr` and to
> allow the end user to see and use the protocol keyword
> for rule manipulations.
> 
> Donald Sharp (3):
>   ip: Use the `struct fib_rule_hdr` for rules
>   ip: Display ip rule protocol used
>   ip: Allow rules to accept a specified protocol
> 
>  include/linux/fib_rules.h |   2 +-
>  ip/iprule.c   | 114 
> ++
>  2 files changed, 65 insertions(+), 51 deletions(-)
> 

you are missing a patch to add protocol to iprule_list_flush_or_save so
'ip ru flush proto NAME' flushes all rules with that protocol.


Re: [PATCH v5 02/14] net: pch_gbe: Pull PHY GPIO handling out of Minnow code

2018-02-17 Thread Andrew Lunn
> Note that this is a driver which is already in mainline, and I didn't
> write it. Claiming that *I* am doing this all wrong is a bit of a
> stretch - all this patch does is make small changes to some existing
> code, which only tangentially relates to a PHY driver, such that it
> ceases to be specific to a single platform.

Hi Paul

I would so you are doing it all wrong for the reset GPIO.

> Even if that is true, rewriting the driver's PHY handling would be a
> very separate change to the changes this series make which allow this
> driver to work on a platform besides the Minnowboard. The *only* thing
> this series does relating to the PHY is allow the reset GPIO to be
> handled properly - rewriting the existing PHY handling is beyond it's
> scope.

Well, you are adding a device tree binding, which needs to be
supported forever. This is going to make things messy in the future
when you do such a cleanup that you follow the PHY binding, in that
you have to handle both what you add here, and the official PHY
binding.

I would prefer that for the moment, you drop the PHY binding patches
in this series. That is what i object to the most. Adding an MDIO
driver and using the standard PHY driver for this PHY is all
internal. You can change that anytime. But adding a binding means an
ABI.

Andrew


Re: [PATCH v5 02/14] net: pch_gbe: Pull PHY GPIO handling out of Minnow code

2018-02-17 Thread Paul Burton
Hi Andrew,

On Sat, Feb 17, 2018 at 11:29:33PM +0100, Andrew Lunn wrote:
> On Sat, Feb 17, 2018 at 12:10:25PM -0800, Paul Burton wrote:
> > The MIPS Boston development board uses the Intel EG20T Platform
> > Controller Hub, including its gigabit ethernet controller, and requires
> > that its RTL8211E PHY be reset much like the Minnow platform. Pull the
> > PHY reset GPIO handling out of Minnow-specific code such that it can be
> > shared by later patches.
> 
> Hi Paul
> 
> I'm i right in saying the driver currently supports the Atheros AT8031
> PHY? The same phy which is supported in drivers/net/phy/at803x.c?

It looks like the driver does contain some code relating to that PHY,
but it's not the one I'm using with the MIPS Boston board - there we
have a Realtek RTL8211E (as mentioned in the commit message) which is
working fine alongside this pch_gbe driver too.

> If so, i think you are doing this all wrong.

Note that this is a driver which is already in mainline, and I didn't
write it. Claiming that *I* am doing this all wrong is a bit of a
stretch - all this patch does is make small changes to some existing
code, which only tangentially relates to a PHY driver, such that it
ceases to be specific to a single platform.

> You would be much better off throwing away pch_gbe_phy.c and write a
> proper MDIO driver. You then get the PHY driver for free, and the MDIO
> code could will handle your GPIO for you, in the standardised way.

Even if that is true, rewriting the driver's PHY handling would be a
very separate change to the changes this series make which allow this
driver to work on a platform besides the Minnowboard. The *only* thing
this series does relating to the PHY is allow the reset GPIO to be
handled properly - rewriting the existing PHY handling is beyond it's
scope.

Note that I do have various cleanups to the driver beyond this series
which I intend to submit after it is functional for my system[1], so I
am not saying that I don't care about improving the driver. But please,
let's do one thing at a time.

Thanks,
Paul

[1] 
https://git.linux-mips.org/cgit/paul/linux.git/log/?h=up417-boston-eth-cleanup


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Florian Westphal  wrote:
> David Miller  wrote:
> > From: Florian Westphal 
> > Date: Fri, 16 Feb 2018 17:14:08 +0100
> > 
> > > Any particular reason why translating iptables rather than nftables
> > > (it should be possible to monitor the nftables changes that are
> > >  announced by kernel and act on those)?
> > 
> > As Daniel said, iptables is by far the most deployed of the two
> > technologies.  Therefore it provides the largest environment for
> > testing and coverage.
> 
> Right, but the approach of hooking old blob format comes with
> lots of limitations that were meant to be resolved with a netlink based
> interface which places kernel in a position to mediate all transactions
> to the rule database (which isn't fixable with old setsockopt format).
> 
> As all programs call iptables(-restore) or variants translation can
> be done in userspace to nftables so api spoken is nfnetlink.
> Such a translator already exists and can handle some cases already:
> 
> nft flush ruleset
> nft list ruleset | wc -l
> 0
> xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
> ESTABLISHED,RELATED -j ACCEPT
> xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags 
> SYN,ACK SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
> "RejectTCPConnectReq"

to be fair, for these two I had to use
$(xtables-compat-multi iptables-translate -A INPUT -i eth0 -m conntrack 
--ctstate ESTABLISHED,RELATED -j ACCEPT)

Reason is that the 'iptables-translate' part nowadays has way more
translations available (nft gained many features since the
iptables-compat layer was added).

If given appropriate prioriy however it should be pretty
trivial to make the 'translate' descriptions available in
the 'direct' version, we already have function in libnftables
to execute/run a command directly from a buffer so this would
not even need fork/execve overhead (although I don't think
its a big concern).

> (f.e. nftables misses some selinux matches/targets for netlabel so we 
> obviously
> can't translate this, same for ipsec sa/policy matching -- but this isn't
> impossible to resolve).

I am working on some poc code for the sa/policy thing now.


Re: [PATCH v5 03/14] dt-bindings: net: Document Intel pch_gbe binding

2018-02-17 Thread Andrew Lunn
> @@ -0,0 +1,25 @@
> +Intel Platform Controller Hub (PCH) GigaBit Ethernet (GBE)
> +
> +Required properties:
> +- compatible:Should be the PCI vendor & device ID, eg. 
> "pci8086,8802".
> +- reg:   Should be a PCI device number as specified by 
> the PCI bus
> + binding to IEEE Std 1275-1994.
> +- phy-reset-gpios:   Should be a GPIO list containing a single GPIO that
> + resets the attached PHY when active.
> +

Hi Paul

Please see Documentation/devicetree/bindings/net/phy.txt. In
particular:

reset-gpios: The GPIO phandle and specifier for the PHY reset signal.

You should be conforming to the existing binding.

   Andrew


Re: [PATCH v5 02/14] net: pch_gbe: Pull PHY GPIO handling out of Minnow code

2018-02-17 Thread Andrew Lunn
On Sat, Feb 17, 2018 at 12:10:25PM -0800, Paul Burton wrote:
> The MIPS Boston development board uses the Intel EG20T Platform
> Controller Hub, including its gigabit ethernet controller, and requires
> that its RTL8211E PHY be reset much like the Minnow platform. Pull the
> PHY reset GPIO handling out of Minnow-specific code such that it can be
> shared by later patches.

Hi Paul

I'm i right in saying the driver currently supports the Atheros AT8031
PHY? The same phy which is supported in drivers/net/phy/at803x.c?

If so, i think you are doing this all wrong. You would be much better
off throwing away pch_gbe_phy.c and write a proper MDIO driver. You
then get the PHY driver for free, and the MDIO code could will handle
your GPIO for you, in the standardised way.

 Andrew


Re: BUG: sleeping function called from invalid context at net/core/sock.c:LINE (3)

2018-02-17 Thread Kirill Tkhai
On 17.02.2018 11:15, Dmitry Vyukov wrote:
> On Sat, Feb 17, 2018 at 4:00 AM, syzbot
>  wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 65bd449c32c2745df61913ab54087e77f9d9b70d (Fri Feb 16 20:26:35 2018 +)
>> Merge branch 'tipc-de-generealize-topology-server'
> 
> +tipc maintainers

This looks to be caused by commit 0ef897be12b8
"tipc: separate topology server listener socket from subcsriber sockets"

Thanks,
Kirill


Re: [PATCH v5 01/14] net: pch_gbe: Mark Minnow PHY reset GPIO active low

2018-02-17 Thread Andrew Lunn
> @@ -2700,10 +2701,10 @@ static int pch_gbe_minnow_platform_init(struct 
> pci_dev *pdev)
>   return ret;
>   }
>  
> - gpio_set_value(gpio, 0);
> - usleep_range(1250, 1500);
>   gpio_set_value(gpio, 1);
>   usleep_range(1250, 1500);
> + gpio_set_value(gpio, 0);
> + usleep_range(1250, 1500);

Hi Paul

It would be better to rewrite and use the gpiod_ API. The GPIO core
would then handle active low/active high.

  Andrew


[PATCH v5 13/14] ptp: pch: Allow build on MIPS platforms

2018-02-17 Thread Paul Burton
Allow the ptp_pch driver to be built on MIPS platforms in preparation
for use on the MIPS Boston board.

Signed-off-by: Paul Burton 
Acked-by: Richard Cochran 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- Newly included in this series to satisfy Kconfig.

Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/ptp/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index a21ad10d613c..8618982ab96a 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -90,7 +90,7 @@ config DP83640_PHY
 
 config PTP_1588_CLOCK_PCH
tristate "Intel PCH EG20T as PTP clock"
-   depends on X86_32 || COMPILE_TEST
+   depends on X86_32 || MIPS || COMPILE_TEST
depends on HAS_IOMEM && NET
imply PTP_1588_CLOCK
help
-- 
2.16.1



[PATCH v5 11/14] net: pch_gbe: Ensure DMA is ordered with descriptor writes

2018-02-17 Thread Paul Burton
On weakly ordered systems writes to the RX or TX descriptors may be
reordered with the write to the DMA control register that enables DMA.
If this happens then the device may see descriptors in an intermediate
& invalid state, leading to incorrect behaviour. Add barriers to ensure
that DMA is enabled only after all writes to the descriptors.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- New patch.

Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 4354842b9b7e..8e3ad7dcef0b 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1260,6 +1260,9 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter 
*adapter,
tx_desc->tx_frame_ctrl = (frame_ctrl);
tx_desc->gbec_status = (DSC_INIT16);
 
+   /* Ensure writes to descriptors complete before DMA begins */
+   mmiowb();
+
if (unlikely(++ring_num == tx_ring->count))
ring_num = 0;
 
@@ -1961,6 +1964,9 @@ int pch_gbe_up(struct pch_gbe_adapter *adapter)
pch_gbe_alloc_rx_buffers(adapter, rx_ring, rx_ring->count);
adapter->tx_queue_len = netdev->tx_queue_len;
 
+   /* Ensure writes to descriptors complete before DMA begins */
+   mmiowb();
+
pch_gbe_enable_dma_tx(>hw);
pch_gbe_enable_dma_rx(>hw);
pch_gbe_enable_mac_rx(>hw);
-- 
2.16.1



[PATCH v5 07/14] net: pch_gbe: Fix handling of TX padding

2018-02-17 Thread Paul Burton
The ethernet controller found in the Intel EG20T Platform Controller
Hub requires that we place 2 bytes of padding between the ethernet
header & the packet payload. Our pch_gbe driver handles this by copying
packets to be transmitted to a temporary struct skb with the padding
bytes inserted, however it sets the length of this temporary skb to
equal that of the original, without the 2 padding bytes, and then uses
this length as that of the memory to map for DMA.

This is problematic on systems that don't have cache-coherent DMA, since
if the length of the original buffer is either a multiple of the
system's cache line size or one less than such a multiple then the size
we provide to dma_map_single() will not cover the last byte or two of
the data when rounded up to a cache line boundary. This may result in us
transmitting corrupt data in the last one or two bytes of the packet,
depending upon its length.

Fix this by setting the length of tmp_skb to include the 2 padding
bytes, which is actually the length of the data it holds. This is then
assigned to buffer_info->length & provided to dma_map_single() which
will operate on all of the data as desired. The EG20T datasheet
specifies that the padding bytes should not be included in the length
stored in the TX descriptor, so we switch that to use the length of the
original skb.

Whilst modifying this code we switch to using PCH_GBE_DMA_PADDING rather
than the magic number 2 to specify the size of the padding, making it
clearer what the code is doing, and fix a typo in the comment indicating
that padding is inserted.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org
---

Changes in v5: None
Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 77f7fbd98e8f..60e91c0fc98b 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1223,13 +1223,12 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter 
*adapter,
buffer_info = _ring->buffer_info[ring_num];
tmp_skb = buffer_info->skb;
 
-   /* [Header:14][payload] ---> [Header:14][paddong:2][payload]*/
+   /* [Header:14][payload] ---> [Header:14][padding:2][payload] */
memcpy(tmp_skb->data, skb->data, ETH_HLEN);
-   tmp_skb->data[ETH_HLEN] = 0x00;
-   tmp_skb->data[ETH_HLEN + 1] = 0x00;
-   tmp_skb->len = skb->len;
-   memcpy(_skb->data[ETH_HLEN + 2], >data[ETH_HLEN],
-  (skb->len - ETH_HLEN));
+   memset(_skb->data[ETH_HLEN], 0, PCH_GBE_DMA_PADDING);
+   memcpy(_skb->data[ETH_HLEN + PCH_GBE_DMA_PADDING],
+  >data[ETH_HLEN], skb->len - ETH_HLEN);
+   tmp_skb->len = skb->len + PCH_GBE_DMA_PADDING;
/*-- Set Buffer information --*/
buffer_info->length = tmp_skb->len;
buffer_info->dma = dma_map_single(>pdev->dev, tmp_skb->data,
@@ -1248,8 +1247,8 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter 
*adapter,
/*-- Set Tx descriptor --*/
tx_desc = PCH_GBE_TX_DESC(*tx_ring, ring_num);
tx_desc->buffer_addr = (buffer_info->dma);
-   tx_desc->length = (tmp_skb->len);
-   tx_desc->tx_words_eob = ((tmp_skb->len + 3));
+   tx_desc->length = (skb->len);
+   tx_desc->tx_words_eob = ((skb->len + 3));
tx_desc->tx_frame_ctrl = (frame_ctrl);
tx_desc->gbec_status = (DSC_INIT16);
 
-- 
2.16.1



[PATCH v5 05/14] net: pch_gbe: Always reset PHY along with MAC

2018-02-17 Thread Paul Burton
On the MIPS Boston development board, the EG20T MAC does not report
receiving the RX clock from the (RGMII) RTL8211E PHY unless the PHY is
reset at the same time as the MAC. Since the pch_gbe driver resets the
MAC a number of times - twice during probe, and when taking down the
network interface - we need to reset the PHY at all the same times. Do
that from pch_gbe_mac_reset_hw which is used to reset the MAC in all
cases.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org
---

Changes in v5: None
Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 11e8ced4a0f4..90e795d5cc1c 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -380,10 +380,13 @@ static void pch_gbe_mac_reset_hw(struct pch_gbe_hw *hw)
 {
/* Read the MAC address. and store to the private data */
pch_gbe_mac_read_mac_addr(hw);
+   pch_gbe_phy_set_reset(hw, 1);
iowrite32(PCH_GBE_ALL_RST, >reg->RESET);
 #ifdef PCH_GBE_MAC_IFOP_RGMII
iowrite32(PCH_GBE_MODE_GMII_ETHER, >reg->MODE);
 #endif
+   pch_gbe_phy_set_reset(hw, 0);
+   usleep_range(1250, 1500);
pch_gbe_wait_clr_bit(>reg->RESET, PCH_GBE_ALL_RST);
/* Setup the receive addresses */
pch_gbe_mac_mar_set(hw, hw->mac.addr, 0);
-- 
2.16.1



[PATCH v5 03/14] dt-bindings: net: Document Intel pch_gbe binding

2018-02-17 Thread Paul Burton
Introduce documentation for a device tree binding for the Intel Platform
Controller Hub (PCH) GigaBit Ethernet (GBE) device. Although this is a
PCIe device & thus largely auto-detectable, this binding will be used to
provide the driver with the PHY reset GPIO.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: Mark Rutland 
Cc: Rob Herring 
Cc: devicet...@vger.kernel.org
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- Use standard gpio & ethernet node names in example.
- Remove bus number from example unit addresses.

Changes in v4: None
Changes in v3:
- New patch.

Changes in v2: None

 Documentation/devicetree/bindings/net/pch_gbe.txt | 25 +++
 1 file changed, 25 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/pch_gbe.txt

diff --git a/Documentation/devicetree/bindings/net/pch_gbe.txt 
b/Documentation/devicetree/bindings/net/pch_gbe.txt
new file mode 100644
index ..cff2687e6e75
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/pch_gbe.txt
@@ -0,0 +1,25 @@
+Intel Platform Controller Hub (PCH) GigaBit Ethernet (GBE)
+
+Required properties:
+- compatible:  Should be the PCI vendor & device ID, eg. 
"pci8086,8802".
+- reg: Should be a PCI device number as specified by the PCI 
bus
+   binding to IEEE Std 1275-1994.
+- phy-reset-gpios: Should be a GPIO list containing a single GPIO that
+   resets the attached PHY when active.
+
+Example:
+
+   ethernet@0,1 {
+   compatible = "pci8086,8802";
+   reg = <0x00020100 0 0 0 0>;
+   phy-reset-gpios = <_gpio 6
+  GPIO_ACTIVE_LOW>;
+   };
+
+   eg20t_gpio: gpio@0,2 {
+   compatible = "pci8086,8803";
+   reg = <0x00020200 0 0 0 0>;
+
+   gpio-controller;
+   #gpio-cells = <2>;
+   };
-- 
2.16.1



[PATCH v5 08/14] net: pch_gbe: Fold pch_gbe_setup_[rt]ctl into pch_gbe_configure_[rt]x

2018-02-17 Thread Paul Burton
The pch_gbe driver splits configuration of the receive path between
pch_gbe_setup_rctl() & pch_gbe_configure_rx(), which are always called
together and in that order. The split between the two functions seems
somewhat arbitrary, as both are configuring registers for the receive
path. Fold pch_gbe_setup_rctl() into pch_gbe_configure_rx() such that
callers only need to call one function to configure the receive path
registers.

Similarly configuration of transmit path registers is split between
pch_gbe_setup_tctl() & pch_gbe_configure_tx(), and we fold the former
into the latter in the same way.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- New patch.

Changes in v4: None
Changes in v3: None
Changes in v2: None

 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 52 ++
 1 file changed, 13 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 60e91c0fc98b..2d6980603ee4 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -831,39 +831,26 @@ static void pch_gbe_irq_enable(struct pch_gbe_adapter 
*adapter)
   ioread32(>reg->INT_EN));
 }
 
-
-
 /**
- * pch_gbe_setup_tctl - configure the Transmit control registers
+ * pch_gbe_configure_tx - Configure Transmit Unit after Reset
  * @adapter:  Board private structure
  */
-static void pch_gbe_setup_tctl(struct pch_gbe_adapter *adapter)
+static void pch_gbe_configure_tx(struct pch_gbe_adapter *adapter)
 {
struct pch_gbe_hw *hw = >hw;
-   u32 tx_mode, tcpip;
+   u32 tdba, tdlen, dctrl, tx_mode, tcpip;
 
tx_mode = PCH_GBE_TM_LONG_PKT |
PCH_GBE_TM_ST_AND_FD |
PCH_GBE_TM_SHORT_PKT |
PCH_GBE_TM_TH_TX_STRT_8 |
-   PCH_GBE_TM_TH_ALM_EMP_4 | PCH_GBE_TM_TH_ALM_FULL_8;
-
+   PCH_GBE_TM_TH_ALM_EMP_4 |
+   PCH_GBE_TM_TH_ALM_FULL_8;
iowrite32(tx_mode, >reg->TX_MODE);
 
tcpip = ioread32(>reg->TCPIP_ACC);
tcpip |= PCH_GBE_TX_TCPIPACC_EN;
iowrite32(tcpip, >reg->TCPIP_ACC);
-   return;
-}
-
-/**
- * pch_gbe_configure_tx - Configure Transmit Unit after Reset
- * @adapter:  Board private structure
- */
-static void pch_gbe_configure_tx(struct pch_gbe_adapter *adapter)
-{
-   struct pch_gbe_hw *hw = >hw;
-   u32 tdba, tdlen, dctrl;
 
netdev_dbg(adapter->netdev, "dma addr = 0x%08llx  size = 0x%08x\n",
   (unsigned long long)adapter->tx_ring->dma,
@@ -883,35 +870,25 @@ static void pch_gbe_configure_tx(struct pch_gbe_adapter 
*adapter)
 }
 
 /**
- * pch_gbe_setup_rctl - Configure the receive control registers
+ * pch_gbe_configure_rx - Configure Receive Unit after Reset
  * @adapter:  Board private structure
  */
-static void pch_gbe_setup_rctl(struct pch_gbe_adapter *adapter)
+static void pch_gbe_configure_rx(struct pch_gbe_adapter *adapter)
 {
struct pch_gbe_hw *hw = >hw;
-   u32 rx_mode, tcpip;
-
-   rx_mode = PCH_GBE_ADD_FIL_EN | PCH_GBE_MLT_FIL_EN |
-   PCH_GBE_RH_ALM_EMP_4 | PCH_GBE_RH_ALM_FULL_4 | PCH_GBE_RH_RD_TRG_8;
+   u32 rdba, rdlen, rxdma, rx_mode, tcpip;
 
+   rx_mode = PCH_GBE_ADD_FIL_EN |
+ PCH_GBE_MLT_FIL_EN |
+ PCH_GBE_RH_ALM_EMP_4 |
+ PCH_GBE_RH_ALM_FULL_4 |
+ PCH_GBE_RH_RD_TRG_8;
iowrite32(rx_mode, >reg->RX_MODE);
 
tcpip = ioread32(>reg->TCPIP_ACC);
-
tcpip |= PCH_GBE_RX_TCPIPACC_OFF;
tcpip &= ~PCH_GBE_RX_TCPIPACC_EN;
iowrite32(tcpip, >reg->TCPIP_ACC);
-   return;
-}
-
-/**
- * pch_gbe_configure_rx - Configure Receive Unit after Reset
- * @adapter:  Board private structure
- */
-static void pch_gbe_configure_rx(struct pch_gbe_adapter *adapter)
-{
-   struct pch_gbe_hw *hw = >hw;
-   u32 rdba, rdlen, rxdma;
 
netdev_dbg(adapter->netdev, "dma adr = 0x%08llx  size = 0x%08x\n",
   (unsigned long long)adapter->rx_ring->dma,
@@ -1954,9 +1931,7 @@ int pch_gbe_up(struct pch_gbe_adapter *adapter)
/* hardware has been reset, we need to reload some things */
pch_gbe_set_multi(netdev);
 
-   pch_gbe_setup_tctl(adapter);
pch_gbe_configure_tx(adapter);
-   pch_gbe_setup_rctl(adapter);
pch_gbe_configure_rx(adapter);
 
err = pch_gbe_request_irq(adapter);
@@ -2486,7 +2461,6 @@ static int __pch_gbe_suspend(struct pci_dev *pdev)
pch_gbe_down(adapter);
if (wufc) {
pch_gbe_set_multi(netdev);
-   pch_gbe_setup_rctl(adapter);
pch_gbe_configure_rx(adapter);
pch_gbe_set_rgmii_ctrl(adapter, hw->mac.link_speed,
hw->mac.link_duplex);
-- 
2.16.1



[PATCH v5 10/14] net: pch_gbe: Disable TX DMA whilst configuring descriptors

2018-02-17 Thread Paul Burton
The pch_gbe driver enables TX DMA the first time we call
pch_gbe_configure_tx() and never disables it again, even if we
reconfigure the device & modify the transmit descriptor ring. This seems
unsafe, since the device may continue accessing descriptors whilst they
are in an unpredictable & possibly invalid state - especially on systems
where the CPUs writes to the descriptors is not coherent with DMA.

In the RX path pch_gbe_configure_rx() disables DMA before configuring
the descriptor pointers & before we set up the descriptors, then
pch_gbe_up() calls pch_gbe_enable_dma_rx() to enable DMA again after the
descriptors have been configured. Here we copy that same scheme for the
TX path - pch_gbe_configure_tx() calls pch_gbe_disable_dma_tx() to
disable DMA, and then after the descriptors have been configured
pch_gbe_up() calls pch_gbe_enable_dma_tx() to enable DMA. This should
ensure that the device doesn't begin reading descriptors before we have
configured them.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- New patch.

Changes in v4: None
Changes in v3: None
Changes in v2: None

 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 29 +-
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index b6cc4a34ed89..4354842b9b7e 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -851,6 +851,24 @@ static void pch_gbe_enable_dma_rx(struct pch_gbe_hw *hw)
iowrite32(rxdma, >reg->DMA_CTRL);
 }
 
+static void pch_gbe_disable_dma_tx(struct pch_gbe_hw *hw)
+{
+   u32 rxdma;
+
+   rxdma = ioread32(>reg->DMA_CTRL);
+   rxdma &= ~PCH_GBE_TX_DMA_EN;
+   iowrite32(rxdma, >reg->DMA_CTRL);
+}
+
+static void pch_gbe_enable_dma_tx(struct pch_gbe_hw *hw)
+{
+   u32 rxdma;
+
+   rxdma = ioread32(>reg->DMA_CTRL);
+   rxdma |= PCH_GBE_TX_DMA_EN;
+   iowrite32(rxdma, >reg->DMA_CTRL);
+}
+
 /**
  * pch_gbe_configure_tx - Configure Transmit Unit after Reset
  * @adapter:  Board private structure
@@ -858,7 +876,7 @@ static void pch_gbe_enable_dma_rx(struct pch_gbe_hw *hw)
 static void pch_gbe_configure_tx(struct pch_gbe_adapter *adapter)
 {
struct pch_gbe_hw *hw = >hw;
-   u32 tdba, tdlen, dctrl, tx_mode, tcpip;
+   u32 tdba, tdlen, tx_mode, tcpip;
 
tx_mode = PCH_GBE_TM_LONG_PKT |
PCH_GBE_TM_ST_AND_FD |
@@ -876,17 +894,14 @@ static void pch_gbe_configure_tx(struct pch_gbe_adapter 
*adapter)
   (unsigned long long)adapter->tx_ring->dma,
   adapter->tx_ring->size);
 
+   pch_gbe_disable_dma_tx(hw);
+
/* Setup the HW Tx Head and Tail descriptor pointers */
tdba = adapter->tx_ring->dma;
tdlen = adapter->tx_ring->size - 0x10;
iowrite32(tdba, >reg->TX_DSC_BASE);
iowrite32(tdlen, >reg->TX_DSC_SIZE);
iowrite32(tdba, >reg->TX_DSC_SW_P);
-
-   /* Enables Transmission DMA */
-   dctrl = ioread32(>reg->DMA_CTRL);
-   dctrl |= PCH_GBE_TX_DMA_EN;
-   iowrite32(dctrl, >reg->DMA_CTRL);
 }
 
 /**
@@ -1945,6 +1960,8 @@ int pch_gbe_up(struct pch_gbe_adapter *adapter)
pch_gbe_alloc_tx_buffers(adapter, tx_ring);
pch_gbe_alloc_rx_buffers(adapter, rx_ring, rx_ring->count);
adapter->tx_queue_len = netdev->tx_queue_len;
+
+   pch_gbe_enable_dma_tx(>hw);
pch_gbe_enable_dma_rx(>hw);
pch_gbe_enable_mac_rx(>hw);
 
-- 
2.16.1



[PATCH v5 04/14] net: pch_gbe: Add device tree support

2018-02-17 Thread Paul Burton
Introduce support for retrieving the PHY reset GPIO from device tree,
which will be used on the MIPS Boston development board. This requires
support for probe deferral in order to work correctly, since the order
of device probe is not guaranteed & typically the EG20T GPIO controller
device will be probed after the ethernet MAC.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org
---

Changes in v5: None
Changes in v4:
- Use ERR_CAST(), thanks kbuild test robot/Fengguang!

Changes in v3: None
Changes in v2:
- Tidy up handling of parsing private data, drop err_out.

 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 31 +-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 712ac2f7bb2c..11e8ced4a0f4 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define DRV_VERSION "1.01"
 const char pch_driver_version[] = DRV_VERSION;
@@ -2556,13 +2558,40 @@ static void pch_gbe_remove(struct pci_dev *pdev)
free_netdev(netdev);
 }
 
+static struct pch_gbe_privdata *
+pch_gbe_get_priv(struct pci_dev *pdev, const struct pci_device_id *pci_id)
+{
+   struct pch_gbe_privdata *pdata;
+   struct gpio_desc *gpio;
+
+   if (!IS_ENABLED(CONFIG_OF))
+   return (struct pch_gbe_privdata *)pci_id->driver_data;
+
+   pdata = devm_kzalloc(>dev, sizeof(*pdata), GFP_KERNEL);
+   if (!pdata)
+   return ERR_PTR(-ENOMEM);
+
+   gpio = devm_gpiod_get(>dev, "phy-reset", GPIOD_ASIS);
+   if (!IS_ERR(gpio))
+   pdata->phy_reset_gpio = gpio;
+   else if (PTR_ERR(gpio) != -ENOENT)
+   return ERR_CAST(gpio);
+
+   return pdata;
+}
+
 static int pch_gbe_probe(struct pci_dev *pdev,
  const struct pci_device_id *pci_id)
 {
struct net_device *netdev;
struct pch_gbe_adapter *adapter;
+   struct pch_gbe_privdata *pdata;
int ret;
 
+   pdata = pch_gbe_get_priv(pdev, pci_id);
+   if (IS_ERR(pdata))
+   return PTR_ERR(pdata);
+
ret = pcim_enable_device(pdev);
if (ret)
return ret;
@@ -2600,7 +2629,7 @@ static int pch_gbe_probe(struct pci_dev *pdev,
adapter->pdev = pdev;
adapter->hw.back = adapter;
adapter->hw.reg = pcim_iomap_table(pdev)[PCH_GBE_PCI_BAR];
-   adapter->pdata = (struct pch_gbe_privdata *)pci_id->driver_data;
+   adapter->pdata = pdata;
if (adapter->pdata && adapter->pdata->platform_init)
adapter->pdata->platform_init(pdev, adapter->pdata);
 
-- 
2.16.1



[PATCH v5 09/14] net: pch_gbe: Use pch_gbe_disable_dma_rx() in pch_gbe_configure_rx()

2018-02-17 Thread Paul Burton
The pch_gbe_configure_rx() function open-codes the equivalent of
pch_gbe_disable_dma_rx(). Remove the duplication by moving
pch_gbe_disable_dma_rx(), and pch_gbe_enable_dma_rx() for consistency,
to be defined earlier than pch_gbe_configure_rx() and have
pch_gbe_configure_rx() call pch_gbe_disable_dma_rx() rather than
duplicate its functionality.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- New patch.

Changes in v4: None
Changes in v3: None
Changes in v2: None

 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 48 ++
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 2d6980603ee4..b6cc4a34ed89 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -831,6 +831,26 @@ static void pch_gbe_irq_enable(struct pch_gbe_adapter 
*adapter)
   ioread32(>reg->INT_EN));
 }
 
+static void pch_gbe_disable_dma_rx(struct pch_gbe_hw *hw)
+{
+   u32 rxdma;
+
+   /* Disable Receive DMA */
+   rxdma = ioread32(>reg->DMA_CTRL);
+   rxdma &= ~PCH_GBE_RX_DMA_EN;
+   iowrite32(rxdma, >reg->DMA_CTRL);
+}
+
+static void pch_gbe_enable_dma_rx(struct pch_gbe_hw *hw)
+{
+   u32 rxdma;
+
+   /* Enables Receive DMA */
+   rxdma = ioread32(>reg->DMA_CTRL);
+   rxdma |= PCH_GBE_RX_DMA_EN;
+   iowrite32(rxdma, >reg->DMA_CTRL);
+}
+
 /**
  * pch_gbe_configure_tx - Configure Transmit Unit after Reset
  * @adapter:  Board private structure
@@ -876,7 +896,7 @@ static void pch_gbe_configure_tx(struct pch_gbe_adapter 
*adapter)
 static void pch_gbe_configure_rx(struct pch_gbe_adapter *adapter)
 {
struct pch_gbe_hw *hw = >hw;
-   u32 rdba, rdlen, rxdma, rx_mode, tcpip;
+   u32 rdba, rdlen, rx_mode, tcpip;
 
rx_mode = PCH_GBE_ADD_FIL_EN |
  PCH_GBE_MLT_FIL_EN |
@@ -897,11 +917,7 @@ static void pch_gbe_configure_rx(struct pch_gbe_adapter 
*adapter)
pch_gbe_mac_force_mac_fc(hw);
 
pch_gbe_disable_mac_rx(hw);
-
-   /* Disables Receive DMA */
-   rxdma = ioread32(>reg->DMA_CTRL);
-   rxdma &= ~PCH_GBE_RX_DMA_EN;
-   iowrite32(rxdma, >reg->DMA_CTRL);
+   pch_gbe_disable_dma_rx(hw);
 
netdev_dbg(adapter->netdev,
   "MAC_RX_EN reg = 0x%08x  DMA_CTRL reg = 0x%08x\n",
@@ -1290,26 +1306,6 @@ void pch_gbe_update_stats(struct pch_gbe_adapter 
*adapter)
spin_unlock_irqrestore(>stats_lock, flags);
 }
 
-static void pch_gbe_disable_dma_rx(struct pch_gbe_hw *hw)
-{
-   u32 rxdma;
-
-   /* Disable Receive DMA */
-   rxdma = ioread32(>reg->DMA_CTRL);
-   rxdma &= ~PCH_GBE_RX_DMA_EN;
-   iowrite32(rxdma, >reg->DMA_CTRL);
-}
-
-static void pch_gbe_enable_dma_rx(struct pch_gbe_hw *hw)
-{
-   u32 rxdma;
-
-   /* Enables Receive DMA */
-   rxdma = ioread32(>reg->DMA_CTRL);
-   rxdma |= PCH_GBE_RX_DMA_EN;
-   iowrite32(rxdma, >reg->DMA_CTRL);
-}
-
 /**
  * pch_gbe_intr - Interrupt Handler
  * @irq:   Interrupt number
-- 
2.16.1



[PATCH v5 14/14] net: pch_gbe: Allow build on MIPS platforms

2018-02-17 Thread Paul Burton
Allow the pch_gbe driver to be built on MIPS platforms, allowing its use
on the MIPS Boston development board.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5: None
Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig 
b/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
index 5f7a35212796..4d3809ae75e1 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/Kconfig
@@ -4,7 +4,7 @@
 
 config PCH_GBE
tristate "OKI SEMICONDUCTOR IOH(ML7223/ML7831) GbE"
-   depends on PCI && (X86_32 || COMPILE_TEST)
+   depends on PCI && (X86_32 || MIPS || COMPILE_TEST)
select MII
select PTP_1588_CLOCK_PCH
select NET_PTP_CLASSIFY
-- 
2.16.1



[PATCH v5 12/14] net: pch_gbe: Fix TX RX descriptor accesses for big endian systems

2018-02-17 Thread Paul Burton
From: Hassan Naveed 

Fix pch_gbe driver for ethernet operations for a big endian CPU.
Values written to and read from transmit and receive descriptors
in the pch_gbe driver are byte swapped from the perspective of a
big endian CPU, since the ethernet controller always operates in
little endian mode. Rectify this by appropriately byte swapping
these descriptor field values in the driver software.

Signed-off-by: Hassan Naveed 
Signed-off-by: Paul Burton 
Reviewed-by: Paul Burton 
Reviewed-by: Matt Redfearn 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- Newly included in this series.

Changes in v4: None
Changes in v3: None
Changes in v2:
- Use __le{16,32} for field types, checked with sparse.

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h| 22 
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 66 --
 2 files changed, 46 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index 8ba9ced2d1fd..7159e39b4685 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -431,13 +431,13 @@ struct pch_gbe_hw {
  * @reserved2: Reserved
  */
 struct pch_gbe_rx_desc {
-   u32 buffer_addr;
-   u32 tcp_ip_status;
-   u16 rx_words_eob;
-   u16 gbec_status;
+   __le32 buffer_addr;
+   __le32 tcp_ip_status;
+   __le16 rx_words_eob;
+   __le16 gbec_status;
u8 dma_status;
u8 reserved1;
-   u16 reserved2;
+   __le16 reserved2;
 };
 
 /**
@@ -452,14 +452,14 @@ struct pch_gbe_rx_desc {
  * @gbec_status:   GMAC Status
  */
 struct pch_gbe_tx_desc {
-   u32 buffer_addr;
-   u16 length;
-   u16 reserved1;
-   u16 tx_words_eob;
-   u16 tx_frame_ctrl;
+   __le32 buffer_addr;
+   __le16 length;
+   __le16 reserved1;
+   __le16 tx_words_eob;
+   __le16 tx_frame_ctrl;
u8 dma_status;
u8 reserved2;
-   u16 gbec_status;
+   __le16 gbec_status;
 };
 
 
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 8e3ad7dcef0b..a0b8c8f4b4c9 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -1254,11 +1254,11 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter 
*adapter,
 
/*-- Set Tx descriptor --*/
tx_desc = PCH_GBE_TX_DESC(*tx_ring, ring_num);
-   tx_desc->buffer_addr = (buffer_info->dma);
-   tx_desc->length = (skb->len);
-   tx_desc->tx_words_eob = ((skb->len + 3));
-   tx_desc->tx_frame_ctrl = (frame_ctrl);
-   tx_desc->gbec_status = (DSC_INIT16);
+   tx_desc->buffer_addr = cpu_to_le32(buffer_info->dma);
+   tx_desc->length = cpu_to_le16(skb->len);
+   tx_desc->tx_words_eob = cpu_to_le16(skb->len + 3);
+   tx_desc->tx_frame_ctrl = cpu_to_le16(frame_ctrl);
+   tx_desc->gbec_status = cpu_to_le16(DSC_INIT16);
 
/* Ensure writes to descriptors complete before DMA begins */
mmiowb();
@@ -1447,8 +1447,8 @@ pch_gbe_alloc_rx_buffers(struct pch_gbe_adapter *adapter,
}
buffer_info->mapped = true;
rx_desc = PCH_GBE_RX_DESC(*rx_ring, i);
-   rx_desc->buffer_addr = (buffer_info->dma);
-   rx_desc->gbec_status = DSC_INIT16;
+   rx_desc->buffer_addr = cpu_to_le32(buffer_info->dma);
+   rx_desc->gbec_status = cpu_to_le16(DSC_INIT16);
 
netdev_dbg(netdev,
   "i = %d  buffer_info->dma = 0x08%llx  
buffer_info->length = 0x%x\n",
@@ -1520,7 +1520,7 @@ static void pch_gbe_alloc_tx_buffers(struct 
pch_gbe_adapter *adapter,
skb_reserve(skb, PCH_GBE_DMA_ALIGN);
buffer_info->skb = skb;
tx_desc = PCH_GBE_TX_DESC(*tx_ring, i);
-   tx_desc->gbec_status = (DSC_INIT16);
+   tx_desc->gbec_status = cpu_to_le16(DSC_INIT16);
}
return;
 }
@@ -1551,11 +1551,12 @@ pch_gbe_clean_tx(struct pch_gbe_adapter *adapter,
i = tx_ring->next_to_clean;
tx_desc = PCH_GBE_TX_DESC(*tx_ring, i);
netdev_dbg(adapter->netdev, "gbec_status:0x%04x  dma_status:0x%04x\n",
-  tx_desc->gbec_status, tx_desc->dma_status);
+  le16_to_cpu(tx_desc->gbec_status), tx_desc->dma_status);
 
unused = PCH_GBE_DESC_UNUSED(tx_ring);
thresh = tx_ring->count - PCH_GBE_TX_WEIGHT;
-   if ((tx_desc->gbec_status == DSC_INIT16) && (unused < thresh))
+   if ((le16_to_cpu(tx_desc->gbec_status) == DSC_INIT16) &&
+   (unused < thresh))
{  /* current marked clean, tx queue filling up, do 

[PATCH v5 00/14] net: pch_gbe: Fixes & MIPS support

2018-02-17 Thread Paul Burton
The Intel EG20T Platform Controller Hub is used on the MIPS Boston
development board to provide various peripherals including ethernet.
This series fixes some issues with the pch_gbe driver discovered whilst
in use on the Boston board, and implements support for device tree which
we use to provide the PHY reset GPIO.

Applies atop v4.16-rc1.

Hassan Naveed (1):
  net: pch_gbe: Fix TX RX descriptor accesses for big endian systems

Paul Burton (13):
  net: pch_gbe: Mark Minnow PHY reset GPIO active low
  net: pch_gbe: Pull PHY GPIO handling out of Minnow code
  dt-bindings: net: Document Intel pch_gbe binding
  net: pch_gbe: Add device tree support
  net: pch_gbe: Always reset PHY along with MAC
  net: pch_gbe: Allow longer for resets
  net: pch_gbe: Fix handling of TX padding
  net: pch_gbe: Fold pch_gbe_setup_[rt]ctl into pch_gbe_configure_[rt]x
  net: pch_gbe: Use pch_gbe_disable_dma_rx() in pch_gbe_configure_rx()
  net: pch_gbe: Disable TX DMA whilst configuring descriptors
  net: pch_gbe: Ensure DMA is ordered with descriptor writes
  ptp: pch: Allow build on MIPS platforms
  net: pch_gbe: Allow build on MIPS platforms

 Documentation/devicetree/bindings/net/pch_gbe.txt  |  25 ++
 drivers/net/ethernet/oki-semi/pch_gbe/Kconfig  |   2 +-
 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h|  27 +-
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 283 -
 drivers/ptp/Kconfig|   2 +-
 5 files changed, 204 insertions(+), 135 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/pch_gbe.txt

-- 
2.16.1



[PATCH v5 06/14] net: pch_gbe: Allow longer for resets

2018-02-17 Thread Paul Burton
Resets of the EG20T MAC on the MIPS Boston development board take longer
than the 1000 loops that pch_gbe_wait_clr_bit was performing. Rather
than simply increasing the number of loops, switch to using
readl_poll_timeout_atomic() from linux/iopoll.h in order to provide some
independence from the speed of the CPU.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org

---

Changes in v5:
- Bump up the timeout based on feedback from Marcin.

Changes in v4: None
Changes in v3:
- Switch to using readl_poll_timeout_atomic().

Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 90e795d5cc1c..77f7fbd98e8f 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define DRV_VERSION "1.01"
@@ -318,13 +319,11 @@ s32 pch_gbe_mac_read_mac_addr(struct pch_gbe_hw *hw)
  */
 static void pch_gbe_wait_clr_bit(void *reg, u32 bit)
 {
+   int err;
u32 tmp;
 
-   /* wait busy */
-   tmp = 1000;
-   while ((ioread32(reg) & bit) && --tmp)
-   cpu_relax();
-   if (!tmp)
+   err = readl_poll_timeout_atomic(reg, tmp, !(tmp & bit), 10, 25000);
+   if (err)
pr_err("Error: busy bit is not cleared\n");
 }
 
-- 
2.16.1



[PATCH v5 01/14] net: pch_gbe: Mark Minnow PHY reset GPIO active low

2018-02-17 Thread Paul Burton
The Minnow PHY reset GPIO is set to 0 to enter reset & 1 to leave reset
- that is, it is an active low GPIO. In order to allow for the code to
be made more generic by further patches, indicate to the GPIO subsystem
that the GPIO is active low & invert the values it is set to such that
they reflect logically whether the device is being reset or not.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org
---

Changes in v5: None
Changes in v4: None
Changes in v3: None
Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 7cd494611a74..d5c6f2e2d3a2 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -2688,7 +2688,8 @@ static int pch_gbe_probe(struct pci_dev *pdev,
  */
 static int pch_gbe_minnow_platform_init(struct pci_dev *pdev)
 {
-   unsigned long flags = GPIOF_DIR_OUT | GPIOF_INIT_HIGH | GPIOF_EXPORT;
+   unsigned long flags = GPIOF_DIR_OUT | GPIOF_INIT_LOW |
+   GPIOF_EXPORT | GPIOF_ACTIVE_LOW;
unsigned gpio = MINNOW_PHY_RESET_GPIO;
int ret;
 
@@ -2700,10 +2701,10 @@ static int pch_gbe_minnow_platform_init(struct pci_dev 
*pdev)
return ret;
}
 
-   gpio_set_value(gpio, 0);
-   usleep_range(1250, 1500);
gpio_set_value(gpio, 1);
usleep_range(1250, 1500);
+   gpio_set_value(gpio, 0);
+   usleep_range(1250, 1500);
 
return ret;
 }
-- 
2.16.1



[PATCH v5 02/14] net: pch_gbe: Pull PHY GPIO handling out of Minnow code

2018-02-17 Thread Paul Burton
The MIPS Boston development board uses the Intel EG20T Platform
Controller Hub, including its gigabit ethernet controller, and requires
that its RTL8211E PHY be reset much like the Minnow platform. Pull the
PHY reset GPIO handling out of Minnow-specific code such that it can be
shared by later patches.

Signed-off-by: Paul Burton 
Cc: David S. Miller 
Cc: linux-m...@linux-mips.org
Cc: netdev@vger.kernel.org
---

Changes in v5:
- Name struct pch_gbe_privdata's platform_init pdata arg, per checkpatch.

Changes in v4: None
Changes in v3:
- Use adapter->pdata as arg to platform_init, to fix bisectability.

Changes in v2: None

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h|  5 +++-
 .../net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c   | 33 +++---
 2 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index 697e29dd4bd3..8ba9ced2d1fd 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -580,15 +580,18 @@ struct pch_gbe_hw_stats {
 
 /**
  * struct pch_gbe_privdata - PCI Device ID driver data
+ * @phy_reset_gpio:PHY reset GPIO descriptor.
  * @phy_tx_clk_delay:  Bool, configure the PHY TX delay in software
  * @phy_disable_hibernate: Bool, disable PHY hibernation
  * @platform_init: Platform initialization callback, called from
  * probe, prior to PHY initialization.
  */
 struct pch_gbe_privdata {
+   struct gpio_desc *phy_reset_gpio;
bool phy_tx_clk_delay;
bool phy_disable_hibernate;
-   int (*platform_init)(struct pci_dev *pdev);
+   int (*platform_init)(struct pci_dev *pdev,
+struct pch_gbe_privdata *pdata);
 };
 
 /**
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c 
b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index d5c6f2e2d3a2..712ac2f7bb2c 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -360,6 +360,16 @@ static void pch_gbe_mac_mar_set(struct pch_gbe_hw *hw, u8 
* addr, u32 index)
pch_gbe_wait_clr_bit(>reg->ADDR_MASK, PCH_GBE_BUSY);
 }
 
+static void pch_gbe_phy_set_reset(struct pch_gbe_hw *hw, int value)
+{
+   struct pch_gbe_adapter *adapter = pch_gbe_hw_to_adapter(hw);
+
+   if (!adapter->pdata || !adapter->pdata->phy_reset_gpio)
+   return;
+
+   gpiod_set_value(adapter->pdata->phy_reset_gpio, value);
+}
+
 /**
  * pch_gbe_mac_reset_hw - Reset hardware
  * @hw:Pointer to the HW structure
@@ -2592,7 +2602,14 @@ static int pch_gbe_probe(struct pci_dev *pdev,
adapter->hw.reg = pcim_iomap_table(pdev)[PCH_GBE_PCI_BAR];
adapter->pdata = (struct pch_gbe_privdata *)pci_id->driver_data;
if (adapter->pdata && adapter->pdata->platform_init)
-   adapter->pdata->platform_init(pdev);
+   adapter->pdata->platform_init(pdev, adapter->pdata);
+
+   if (adapter->pdata && adapter->pdata->phy_reset_gpio) {
+   pch_gbe_phy_set_reset(>hw, 1);
+   usleep_range(1250, 1500);
+   pch_gbe_phy_set_reset(>hw, 0);
+   usleep_range(1250, 1500);
+   }
 
adapter->ptp_pdev =
pci_get_domain_bus_and_slot(pci_domain_nr(adapter->pdev->bus),
@@ -2686,7 +2703,8 @@ static int pch_gbe_probe(struct pci_dev *pdev,
 /* The AR803X PHY on the MinnowBoard requires a physical pin to be toggled to
  * ensure it is awake for probe and init. Request the line and reset the PHY.
  */
-static int pch_gbe_minnow_platform_init(struct pci_dev *pdev)
+static int pch_gbe_minnow_platform_init(struct pci_dev *pdev,
+   struct pch_gbe_privdata *pdata)
 {
unsigned long flags = GPIOF_DIR_OUT | GPIOF_INIT_LOW |
GPIOF_EXPORT | GPIOF_ACTIVE_LOW;
@@ -2695,16 +2713,11 @@ static int pch_gbe_minnow_platform_init(struct pci_dev 
*pdev)
 
ret = devm_gpio_request_one(>dev, gpio, flags,
"minnow_phy_reset");
-   if (ret) {
+   if (!ret)
+   pdata->phy_reset_gpio = gpio_to_desc(gpio);
+   else
dev_err(>dev,
"ERR: Can't request PHY reset GPIO line '%d'\n", gpio);
-   return ret;
-   }
-
-   gpio_set_value(gpio, 1);
-   usleep_range(1250, 1500);
-   gpio_set_value(gpio, 0);
-   usleep_range(1250, 1500);
 
return ret;
 }
-- 
2.16.1



RE: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled messages

2018-02-17 Thread Jon Maloy


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Friday, February 16, 2018 21:33
> To: Jon Maloy 
> Cc: netdev@vger.kernel.org; Mohan Krishna Ghanta Krishnamurthy
> ; Tung Quang Nguyen
> ; Hoang Huu Le
> ; Canh Duc Luu
> ; Ying Xue ; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [net-next v2 1/1] tipc: avoid unnecessary copying of bundled
> messages
> 
> From: Jon Maloy 
> Date: Thu, 15 Feb 2018 14:14:37 +0100
> 
> > A received sk buffer may contain dozens of smaller 'bundled' messages
> > which after extraction go each in their own direction.
> >
> > Unfortunately, when we extract those messages using skb_clone() each
> > of the extracted buffers inherit the truesize value of the original
> > buffer. Apart from causing massive overaccounting of the base buffer's
> > memory, this often causes tipc_msg_validate() to come to the false
> > conclusion that the ratio truesize/datasize > 4, and perform an
> > unnecessary copying of the extracted buffer.
> >
> > We now fix this problem by explicitly correcting the truesize value of
> > the buffer clones to be the truesize of the clone itself plus a
> > calculated fraction of the base buffer's overhead. This change
> > eliminates the overaccounting and at least mitigates the occurrence of
> > unnecessary buffer copying.
> >
> > Reported-by: Hoang Le 
> > Acked-by: Ying Xue 
> > Signed-off-by: Jon Maloy 
> 
> As I explained in my previous two emails, I don't think this method of
> accounting is appropriate.
> 
> All of your clones must use the same skb->truesize as the original SKB
> because each and every one of them keeps the full buffer from being
> liberated until they are released.

I understand what you are saying, although I am not happy with its consequences 
in this case. I guess I will just leave it the way it is until I can come up 
with something  smarter.
///jon


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
David Miller  wrote:
> From: Florian Westphal 
> Date: Fri, 16 Feb 2018 17:14:08 +0100
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

Right, but the approach of hooking old blob format comes with
lots of limitations that were meant to be resolved with a netlink based
interface which places kernel in a position to mediate all transactions
to the rule database (which isn't fixable with old setsockopt format).

As all programs call iptables(-restore) or variants translation can
be done in userspace to nftables so api spoken is nfnetlink.
Such a translator already exists and can handle some cases already:

nft flush ruleset
nft list ruleset | wc -l
0
xtables-compat-multi iptables -A INPUT -s 192.168.0.24 -j ACCEPT
xtables-compat-multi iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 22 -j 
ACCEPT
xtables-compat-multi iptables -A INPUT -i eth0 -m conntrack --ctstate 
ESTABLISHED,RELATED -j ACCEPT
xtables-compat-multi iptables -A INPUT -p icmp -j ACCEPT
xtables-compat-multi iptables -N REJECT_LOG
xtables-compat-multi iptables -A REJECT_LOG -i eth0 -p tcp --tcp-flags SYN,ACK 
SYN --dport 22:80 -m limit --limit 1/sec -j LOG --log-prefix 
"RejectTCPConnectReq"
xtables-compat-multi iptables -A REJECT_LOG -j DROP
xtables-compat-multi iptables -A INPUT -j REJECT_LOG

nft list ruleset
table ip filter {
chain INPUT {
type filter hook input priority 0; policy accept;
ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
ip saddr 192.168.0.0/16 tcp dport 22 counter accept
iifname "eth0" ct state related,established counter accept
ip protocol icmp counter packets 0 bytes 0 accept
counter packets 0 bytes 0 jump REJECT_LOG
}

chain FORWARD {
type filter hook forward priority 0; policy accept;
}

chain OUTPUT {
type filter hook output priority 0; policy accept;
}

chain REJECT_LOG {
iifname "eth0" tcp dport 22-80 tcp flags & (syn | ack) == syn 
limit rate 1/second burst 5 packets counter packets 0 bytes 0 log prefix 
"RejectTCPConnectReq"
counter packets 0 bytes 0 drop
}
}

and, while 'iptables' rules were added, nft monitor in different terminal:
nft monitor
add table ip filter
add chain ip filter INPUT { type filter hook input priority 0; policy accept; }
add chain ip filter FORWARD { type filter hook forward priority 0; policy 
accept; }
add chain ip filter OUTPUT { type filter hook output priority 0; policy accept; 
}
add rule ip filter INPUT ip saddr 192.168.0.24 counter packets 0 bytes 0 accept
# new generation 9893 by process 7471 (xtables-compat-)
add rule ip filter INPUT ip saddr 192.168.0.0/16 tcp dport 22 counter accept
# new generation 9894 by process 7504 (xtables-compat-)
add rule ip filter INPUT iifname "eth0" ct state related,established counter 
accept
# new generation 9895 by process 7528 (xtables-compat-)
add rule ip filter INPUT ip protocol icmp counter packets 0 bytes 0 accept
# new generation 9896 by process 7542 (xtables-compat-)
add chain ip filter REJECT_LOG
# new generation 9897 by process 7595 (xtables-compat-)
add rule ip filter REJECT_LOG iifname "eth0" tcp dport 22-80 tcp flags & (syn | 
ack) == syn limit rate 1/second burst 5 packets counter packets 0 bytes 0 log 
prefix "RejectTCPConnectReq"
# new generation 9898 by process 7639 (xtables-compat-)
add rule ip filter REJECT_LOG counter packets 0 bytes 0 drop
# new generation 9899 by process 7657 (xtables-compat-)
add rule ip filter INPUT counter packets 0 bytes 0 jump REJECT_LOG
# new generation 9900 by process 7663 (xtables-compat-)

Now, does this work in all cases?

Unfortunately not -- this is still work-in-progress, so I would
not rm /sbin/iptables and replace it with a link to xtables-compat-multi just 
yet.

(f.e. nftables misses some selinux matches/targets for netlabel so we obviously
can't translate this, same for ipsec sa/policy matching -- but this isn't
impossible to resolve).

Hopefully this does show that at least some commonly used features work
and that we've come a long way to make seamless nftables transition happen.


Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Florian Westphal
Daniel Borkmann  wrote:
> Hi Florian,
> 
> On 02/16/2018 05:14 PM, Florian Westphal wrote:
> > Florian Westphal  wrote:
> >> Daniel Borkmann  wrote:
> >> Several questions spinning at the moment, I will probably come up with
> >> more:
> > 
> > ... and here there are some more ...
> > 
> > One of the many pain points of xtables design is the assumption of 'used
> > only by sysadmin'.
> > 
> > This has not been true for a very long time, so by now iptables has
> > this userspace lock (yes, its fugly workaround) to serialize concurrent
> > iptables invocations in userspace.
> > 
> > AFAIU the translate-in-userspace design now brings back the old problem
> > of different tools overwriting each others iptables rules.
> 
> Right, so the behavior would need to be adapted to be exactly the same,
> given all the requests go into kernel space first via the usual uapis,
> I don't think there would be anything in the way of keeping that as is.

Uff.  This isn't solveable.  At least thats what I tried to say here.
This is a limitation of the xtables setsockopt interface design.

If $docker (or anything else) adds a new rule using plain iptables other
daemons are not aware of it.

If some deletes a rule added by $software it won't learn that either.

The "solutions" in place now (periodic reloads/'is my rule still in
place' etc. are not desirable long-term.

You'll also need 4 decoders for arp/ip/ip6/ebtables plus translations
for all matches and targets xtables currently has. (almost 100 i would
guess from quick glance).

Some of the more crazy ones also have external user visible interfaces
outside setsockopt (proc files, ipset).

> > One of the nftables advantages is that (since rule representation in
> > kernel is black-box from userspace point of view) is that the kernel
> > can announce add/delete of rules or elements from nftables sets.
> > 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> Yeah, correct, this should be possible as well. We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

Yes, which is why we have translation tools in place.

Just for the fun of it I tried to delete ip/ip6tables binaries on my
fedora27 laptop and replaced them with symlinks to
'xtables-compat-multi'.

Aside from two issues (SELinux denying 'iptables' to use netlink) and
one translation issue (-m rpfilter, which can be translated in current
upstream version) this works out of the box, the translator uses
nftables api to kernel (so kernel doesn't even know which program is
talking...), 'nft monitor' displays the rules being added, and
'nft list ruleset' shows the default firewalld ruleset.

Obviously there are a few limitations, for instance ip6tables-save will
stop working once you add nft-based rules that use features that cannot
be expressed in xtables syntax (it will throw an error message similar
to 'you are using nftables featues not available in xtables, please use
nft'), for intance verdict maps, sets and the like.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

See above, the translator covers most basic use cases nowadays.
The more extreme cases are not covered because we were reluctant to
provide equivalent in nftables (-m time comes to mind which was always a
PITA because kernel has no notion of timezone or DST transitions,
leading to 'magic' mismatches when timezone changes...

I could explain on more problem cases but none of them are too
important I think.

If you'd like to have more ebpf users in the kernel, then there is at
least one use case where ebpf could be very attractive for nftables
(matching dynamic headers and the like).  This would be a new
feature and would need changes on nftables userspace side
as well (we don't have syntax/grammar to represent this in either
nft or iptables).

In most basic form, it would be nftables replacement for '-m string'
(and perhaps also -m bpf to some degree, depends on how it would be
 realized).

We can discuss more if there is interest, but I think it
would be more suitable for conference/face to face discussion.


Re: [PATCH v2 net-next 0/3] Remove IPVlan module dependencies on IPv6 and L3 Master dev

2018-02-17 Thread Matteo Croce
On Sat, Feb 17, 2018 at 8:11 PM, Matteo Croce  wrote:
> The IPVlan module currently depends on IPv6 and L3 Master dev.
> Refactor the code to allow building IPVlan module regardless of the value of
> CONFIG_IPV6 and CONFIG_NETFILTER, and change the CONFIG_NET_L3_MASTER_DEV
> dependency into a select, as compiling L3 Master device alone has no sense.
>
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> CONFIG_IPV6=y
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 48K drivers/net/ipvlan/ipvlan.ko
>
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> # CONFIG_IPV6 is not set
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 44K drivers/net/ipvlan/ipvlan.ko
>
> Matteo Croce (2):
>   ipvlan: drop ipv6 dependency
>   ipvlan: selects master_l3 device instead of depending on it
>
>  drivers/net/Kconfig  |  3 +-
>  drivers/net/ipvlan/ipvlan_core.c | 71 
> ++--
>  drivers/net/ipvlan/ipvlan_main.c | 48 +--
>  3 files changed, 85 insertions(+), 37 deletions(-)
>
> --
> 2.14.3
>

Just noticed the wrong subject, really it's 0/2

Regards,
-- 
Matteo Croce
per aspera ad upstream


[PATCH v2 net-next 1/2] ipvlan: drop ipv6 dependency

2018-02-17 Thread Matteo Croce
IPVlan has an hard dependency on IPv6.
Refactor the ipvlan code to allow compiling it with IPv6 disabled.

Signed-off-by: Matteo Croce 
---
 drivers/net/Kconfig  |  1 -
 drivers/net/ipvlan/ipvlan_core.c | 71 ++--
 drivers/net/ipvlan/ipvlan_main.c | 48 +--
 3 files changed, 84 insertions(+), 36 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 944ec3c9282c..3234c6618d75 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,7 +149,6 @@ config MACVTAP
 config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
-depends on IPV6
 depends on NETFILTER
 depends on NET_L3_MASTER_DEV
 ---help---
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index c1f008fe4e1d..653b00738616 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -35,6 +35,7 @@ void ipvlan_count_rx(const struct ipvl_dev *ipvlan,
 }
 EXPORT_SYMBOL_GPL(ipvlan_count_rx);
 
+#if IS_ENABLED(CONFIG_IPV6)
 static u8 ipvlan_get_v6_hash(const void *iaddr)
 {
const struct in6_addr *ip6_addr = iaddr;
@@ -42,6 +43,12 @@ static u8 ipvlan_get_v6_hash(const void *iaddr)
return __ipv6_addr_jhash(ip6_addr, ipvlan_jhash_secret) &
   IPVLAN_HASH_MASK;
 }
+#else
+static u8 ipvlan_get_v6_hash(const void *iaddr)
+{
+   return 0;
+}
+#endif
 
 static u8 ipvlan_get_v4_hash(const void *iaddr)
 {
@@ -51,6 +58,22 @@ static u8 ipvlan_get_v4_hash(const void *iaddr)
   IPVLAN_HASH_MASK;
 }
 
+static bool addr_equal(bool is_v6, struct ipvl_addr *addr, const void *iaddr) {
+   if (!is_v6 && addr->atype == IPVL_IPV4) {
+   struct in_addr *i4addr = (struct in_addr *)iaddr;
+
+   return addr->ip4addr.s_addr == i4addr->s_addr;
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (is_v6 && addr->atype == IPVL_IPV6) {
+   struct in6_addr *i6addr = (struct in6_addr *)iaddr;
+
+   return ipv6_addr_equal(>ip6addr, i6addr);
+#endif
+   }
+
+   return false;
+}
+
 static struct ipvl_addr *ipvlan_ht_addr_lookup(const struct ipvl_port *port,
   const void *iaddr, bool is_v6)
 {
@@ -59,15 +82,9 @@ static struct ipvl_addr *ipvlan_ht_addr_lookup(const struct 
ipvl_port *port,
 
hash = is_v6 ? ipvlan_get_v6_hash(iaddr) :
   ipvlan_get_v4_hash(iaddr);
-   hlist_for_each_entry_rcu(addr, >hlhead[hash], hlnode) {
-   if (is_v6 && addr->atype == IPVL_IPV6 &&
-   ipv6_addr_equal(>ip6addr, iaddr))
+   hlist_for_each_entry_rcu(addr, >hlhead[hash], hlnode)
+   if (addr_equal(is_v6, addr, iaddr))
return addr;
-   else if (!is_v6 && addr->atype == IPVL_IPV4 &&
-addr->ip4addr.s_addr ==
-   ((struct in_addr *)iaddr)->s_addr)
-   return addr;
-   }
return NULL;
 }
 
@@ -93,13 +110,9 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev 
*ipvlan,
 {
struct ipvl_addr *addr;
 
-   list_for_each_entry(addr, >addrs, anode) {
-   if ((is_v6 && addr->atype == IPVL_IPV6 &&
-   ipv6_addr_equal(>ip6addr, iaddr)) ||
-   (!is_v6 && addr->atype == IPVL_IPV4 &&
-   addr->ip4addr.s_addr == ((struct in_addr *)iaddr)->s_addr))
+   list_for_each_entry(addr, >addrs, anode)
+   if (addr_equal(is_v6, addr, iaddr))
return addr;
-   }
return NULL;
 }
 
@@ -150,6 +163,7 @@ static void *ipvlan_get_L3_hdr(struct ipvl_port *port, 
struct sk_buff *skb, int
lyr3h = ip4h;
break;
}
+#if IS_ENABLED(CONFIG_IPV6)
case htons(ETH_P_IPV6): {
struct ipv6hdr *ip6h;
 
@@ -188,6 +202,7 @@ static void *ipvlan_get_L3_hdr(struct ipvl_port *port, 
struct sk_buff *skb, int
}
break;
}
+#endif
default:
return NULL;
}
@@ -337,14 +352,18 @@ static struct ipvl_addr *ipvlan_addr_lookup(struct 
ipvl_port *port,
 {
struct ipvl_addr *addr = NULL;
 
-   if (addr_type == IPVL_IPV6) {
+   switch (addr_type) {
+#if IS_ENABLED(CONFIG_IPV6)
+   case IPVL_IPV6: {
struct ipv6hdr *ip6h;
struct in6_addr *i6addr;
 
ip6h = (struct ipv6hdr *)lyr3h;
i6addr = use_dest ? >daddr : >saddr;
addr = ipvlan_ht_addr_lookup(port, i6addr, true);
-   } else if (addr_type == IPVL_ICMPV6) {
+   break;
+   }
+   case IPVL_ICMPV6: {
struct nd_msg *ndmh;
struct in6_addr *i6addr;
 
@@ -356,14 +375,19 @@ static struct ipvl_addr *ipvlan_addr_lookup(struct 
ipvl_port *port,
i6addr = >target;
addr = 

[PATCH v2 net-next 0/3] Remove IPVlan module dependencies on IPv6 and L3 Master dev

2018-02-17 Thread Matteo Croce
The IPVlan module currently depends on IPv6 and L3 Master dev.
Refactor the code to allow building IPVlan module regardless of the value of
CONFIG_IPV6 and CONFIG_NETFILTER, and change the CONFIG_NET_L3_MASTER_DEV
dependency into a select, as compiling L3 Master device alone has no sense.

$ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
CONFIG_IPV6=y
CONFIG_IPVLAN=m
$ ll drivers/net/ipvlan/ipvlan.ko
48K drivers/net/ipvlan/ipvlan.ko

$ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
# CONFIG_IPV6 is not set
CONFIG_IPVLAN=m
$ ll drivers/net/ipvlan/ipvlan.ko
44K drivers/net/ipvlan/ipvlan.ko

Matteo Croce (2):
  ipvlan: drop ipv6 dependency
  ipvlan: selects master_l3 device instead of depending on it

 drivers/net/Kconfig  |  3 +-
 drivers/net/ipvlan/ipvlan_core.c | 71 ++--
 drivers/net/ipvlan/ipvlan_main.c | 48 +--
 3 files changed, 85 insertions(+), 37 deletions(-)

-- 
2.14.3



[PATCH v2 net-next 2/2] ipvlan: selects master_l3 device instead of depending on it

2018-02-17 Thread Matteo Croce
The L3 Master device is just a glue between the core networking code and
device drivers, so it should be selected automatically rather than
requiring to be enabled explicitly.

Signed-off-by: Matteo Croce 
---
 drivers/net/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 3234c6618d75..d88b78a17440 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -150,7 +150,7 @@ config IPVLAN
 tristate "IP-VLAN support"
 depends on INET
 depends on NETFILTER
-depends on NET_L3_MASTER_DEV
+select NET_L3_MASTER_DEV
 ---help---
   This allows one to create virtual devices off of a main interface
   and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
-- 
2.14.3



Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-17 Thread Eric Dumazet
On Sat, 2018-02-17 at 11:01 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote:
> > Well, no effect  here on e1000e (1 Gbit) at least
> > 
> > # ethtool -K eth3 sg off
> > Actual changes:
> > scatter-gather: off
> > tx-scatter-gather: off
> > tcp-segmentation-offload: off
> > tx-tcp-segmentation: off [requested on]
> > tx-tcp6-segmentation: off [requested on]
> > generic-segmentation-offload: off [requested on]
> > 
> > # tc qd replace dev eth3 root pfifo_fast
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> > 941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> > 941
> > # tc qd replace dev eth3 root fq
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> > 941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> > 941
> > # tc qd replace dev eth3 root fq_codel
> > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> > 941
> > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> > 941
> > #
> 
> That really looks strange to me. I'm able to reproduce the effect caused by 
> disabling scatter-gather even on the VM (using iperf3, as usual):

This must be some race condition in the code I added in TCP for self-
pacing, when a sort timeout is programmed.

Disabling SG means TCP cooks 1-MSS packets.

I will take a look, probably after the (long) week-end : Tuesday.

Thanks !



Re: [net-next v2 2/2] bpf: Add eBPF seccomp sample programs

2018-02-17 Thread Randy Dunlap
On 02/16/2018 11:36 PM, Sargun Dhillon wrote:
> + close(111);
> + assert(errno == EBADF);
> + close(999);
> + assert(errno = EPERM);

should that be   == ?

> +
> + return 0;
> +}


-- 
~Randy


Re: [net-next PATCH v1 3/3] net: stmmac: dwmac-meson8b: make the clock configurations private

2018-02-17 Thread Martin Blumenstingl
Hi Jerome,

On Sat, Feb 17, 2018 at 5:41 PM, Jerome Brunet  wrote:
> On Sat, 2018-02-17 at 15:08 +0100, Martin Blumenstingl wrote:
>> The common clock framework needs access to the "clock configuration"
>> structs during runtime.
>> However, only the common clock framework should access these. Ensure
>> this by moving the configuration structs out of struct meson8b_dwmac,
>> so only meson8b_init_rgmii_tx_clk() and the common clock framework know
>> about these configurations.
>>
>> Suggested-by: Jerome Brunet 
>> Signed-off-by: Martin Blumenstingl 
>
> Acked-by: Jerome Brunet 
thank you reviewing this!

>> ---
>>  .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 45 
>> --
>>  1 file changed, 24 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
>> b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
>> index 0dfce35c5583..2d5d4aea3bcb 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
>> @@ -49,19 +49,17 @@
>>
>>  struct meson8b_dwmac {
>> struct device   *dev;
>> -
>> void __iomem*regs;
>> -
>> phy_interface_t phy_mode;
>> +   struct clk  *rgmii_tx_clk;
>> +   u32 tx_delay_ns;
>> +};
>>
>> +struct meson8b_dwmac_clk_configs {
>
> Not too sure we needed a struct for this, but it does work and does not matter
> much
I tried it without this struct: this resulted in even more code
because every struct clk_* would have to be devm_kzalloc()'ed, along
with a "if (!result) return -ENOMEM" (which makes it 3 lines per
struct clk_*)
either way: it's still an improvement as per commit message

>> struct clk_mux  m250_mux;
>> struct clk_divider  m250_div;
>> struct clk_fixed_factor fixed_div2;
>> struct clk_gate rgmii_tx_en;
>> -
>> -   struct clk  *rgmii_tx_clk;
>> -
>> -   u32 tx_delay_ns;
>>  };
>

Regards
Martin


Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available

2018-02-17 Thread Alexander Duyck
On Fri, Feb 16, 2018 at 7:04 PM, Jakub Kicinski  wrote:
> On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>
>> Signed-off-by: Sridhar Samudrala 
>> Signed-off-by: Alexander Duyck 
>
>> +static void
>> +virtnet_bypass_get_stats(struct net_device *dev,
>> +  struct rtnl_link_stats64 *stats)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + const struct rtnl_link_stats64 *new;
>> + struct rtnl_link_stats64 temp;
>> + struct net_device *child_netdev;
>> +
>> + spin_lock(>stats_lock);
>> + memcpy(stats, >bypass_stats, sizeof(*stats));
>> +
>> + rcu_read_lock();
>> +
>> + child_netdev = rcu_dereference(vbi->active_netdev);
>> + if (child_netdev) {
>> + new = dev_get_stats(child_netdev, );
>> + virtnet_bypass_fold_stats(stats, new, >active_stats);
>> + memcpy(>active_stats, new, sizeof(*new));
>> + }
>> +
>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>> + if (child_netdev) {
>> + new = dev_get_stats(child_netdev, );
>> + virtnet_bypass_fold_stats(stats, new, >backup_stats);
>> + memcpy(>backup_stats, new, sizeof(*new));
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + memcpy(>bypass_stats, stats, sizeof(*stats));
>> + spin_unlock(>stats_lock);
>> +}
>> +
>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + struct net_device *child_netdev;
>> + int ret = 0;
>> +
>> + child_netdev = rcu_dereference(vbi->active_netdev);
>> + if (child_netdev) {
>> + ret = dev_set_mtu(child_netdev, new_mtu);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>> + if (child_netdev) {
>> + ret = dev_set_mtu(child_netdev, new_mtu);
>> + if (ret)
>> + netdev_err(child_netdev,
>> +"Unexpected failure to set mtu to %d\n",
>> +new_mtu);
>
> You should probably unwind if set fails on one of the legs.

Actually if we know that the backup is always going to be a virtio I
don't know if we even need to worry about the call failing. Last I
knew virtio_net doesn't implement ndo_change_mtu so I don't think it
is an issue. Unless a notifier blows up about it somewhere I don't
think there is anything that should prevent us from updating the MTU.

One interesting thing we may want to take a look at would be to tweak
the ordering of things based on if we are increasing or decreasing the
MTU. In the case of a increase we need to work from the bottom up, but
in the case of a decrease I wonder if we shouldn't be working from the
top down in order to guarantee we don't create an MTU mismatch
somewhere in the path.

>> + }
>> +
>> + dev->mtu = new_mtu;
>> + return 0;
>> +}
>
> nit: stats, mtu, all those mundane things are implemented in team
>  already.  If we had this as kernel-internal team mode we wouldn't
>  have to reimplement them...  You probably did investigate that
>  option, for my edification, would you mind saying what the
>  challenges/downsides were?

So I tried working with the bonding driver to get down to what we
needed. The issue is there are so many controls and such exposed that
trying to pull them out and generate a simple bond became very
difficult to get done. It was just much easier to start over versus
trying to take an existing interface and pare 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-17 Thread Alexander Duyck
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>> Ppatch 2 is in response to the community request for a 3 netdev
>> solution.  However, it creates some issues we'll get into in a moment.
>> It extends virtio_net to use alternate datapath when available and
>> registered. When BACKUP feature is enabled, virtio_net driver creates
>> an additional 'bypass' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'backup' netdev and a passthru/vf device with the same MAC gets
>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> as default for transmits when it is available with link up and running.
>
> Thank you do doing this.
>
>> We noticed a couple of issues with this approach during testing.
>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>   virtio pci device, udev tries to rename both of them with the same name
>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>   to rename the 2 netdevs is not reliable.
>
> Out of curiosity - why do you link the master netdev to the virtio
> struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

> FWIW two solutions that immediately come to mind is to export "backup"
> as phys_port_name of the backup virtio link and/or assign a name to the
> master like you are doing already.  I think team uses team%d and bond
> uses bond%d, soft naming of master devices seems quite natural in this
> case.

I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.

> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> link is quite neat.

I agree. For non-"backup" virio_net devices would it be okay for us to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.

>> - When the 'active' netdev is unplugged OR not present on a destination
>>   system after live migration, the user will see 2 virtio_net netdevs.
>
> That's necessary and expected, all configuration applies to the master
> so master must exist.

With the naming issue resolved this is the only item left outstanding.
This becomes a matter of form vs function.

The main complaint about the "3 netdev" solution is a bit confusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really a bond is
a bit ugly.

The downside of the "2 netdev" solution is that you have to deal with
an extra layer of locking/queueing to get to the VF and you lose some
functionality since things like in-driver XDP have to be disabled in
order to maintain the same functionality when the VF is present or
not. However it looks more like classic virtio_net when the VF is not
present.


Re: [net-next PATCH v1 3/3] net: stmmac: dwmac-meson8b: make the clock configurations private

2018-02-17 Thread Jerome Brunet
On Sat, 2018-02-17 at 15:08 +0100, Martin Blumenstingl wrote:
> The common clock framework needs access to the "clock configuration"
> structs during runtime.
> However, only the common clock framework should access these. Ensure
> this by moving the configuration structs out of struct meson8b_dwmac,
> so only meson8b_init_rgmii_tx_clk() and the common clock framework know
> about these configurations.
> 
> Suggested-by: Jerome Brunet 
> Signed-off-by: Martin Blumenstingl 

Acked-by: Jerome Brunet 

> ---
>  .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 45 
> --
>  1 file changed, 24 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
> b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> index 0dfce35c5583..2d5d4aea3bcb 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
> @@ -49,19 +49,17 @@
>  
>  struct meson8b_dwmac {
> struct device   *dev;
> -
> void __iomem*regs;
> -
> phy_interface_t phy_mode;
> +   struct clk  *rgmii_tx_clk;
> +   u32 tx_delay_ns;
> +};
>  
> +struct meson8b_dwmac_clk_configs {

Not too sure we needed a struct for this, but it does work and does not matter
much

> struct clk_mux  m250_mux;
> struct clk_divider  m250_div;
> struct clk_fixed_factor fixed_div2;
> struct clk_gate rgmii_tx_en;
> -
> -   struct clk  *rgmii_tx_clk;
> -
> -   u32 tx_delay_ns;
>  };



Re: [net-next PATCH v1 1/3] net: stmmac: dwmac-meson8b: simplify clock registration

2018-02-17 Thread Jerome Brunet
On Sat, 2018-02-17 at 15:08 +0100, Martin Blumenstingl wrote:
> To goal of this patch is to simplify the registration of the RGMII TX
> clock (and it's parent clocks). This is achieved by:
> - introducing the meson8b_dwmac_register_clk helper-function to remove
>   code duplication when registering a single clock (this saves a few
>   lines since we have 4 clocks internally)
> - using devm_add_action_or_reset to disable the RGMII TX clock
>   automatically when needed. This also allows us to re-use the standard
>   stmmac_pltfr_remove function.
> - devm_kasprintf() and devm_kstrdup() are not used anymore to generate
>   the clock name (these are replaced by a variable on the stack) because
>   the common clock framework already uses kstrdup() internally.
> 
> No functional changes intended.
> 
> Signed-off-by: Martin Blumenstingl 

Reviewed-by: Jerome Brunet 


[net-next PATCH v1 1/3] net: stmmac: dwmac-meson8b: simplify clock registration

2018-02-17 Thread Martin Blumenstingl
To goal of this patch is to simplify the registration of the RGMII TX
clock (and it's parent clocks). This is achieved by:
- introducing the meson8b_dwmac_register_clk helper-function to remove
  code duplication when registering a single clock (this saves a few
  lines since we have 4 clocks internally)
- using devm_add_action_or_reset to disable the RGMII TX clock
  automatically when needed. This also allows us to re-use the standard
  stmmac_pltfr_remove function.
- devm_kasprintf() and devm_kstrdup() are not used anymore to generate
  the clock name (these are replaced by a variable on the stack) because
  the common clock framework already uses kstrdup() internally.

No functional changes intended.

Signed-off-by: Martin Blumenstingl 
---
 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 156 +
 1 file changed, 65 insertions(+), 91 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 5270d26f0bc6..84a9a900e74e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -55,17 +55,11 @@ struct meson8b_dwmac {
phy_interface_t phy_mode;
 
struct clk_mux  m250_mux;
-   struct clk  *m250_mux_clk;
-   struct clk  *m250_mux_parent[MUX_CLK_NUM_PARENTS];
-
struct clk_divider  m250_div;
-   struct clk  *m250_div_clk;
-
struct clk_fixed_factor fixed_div2;
-   struct clk  *fixed_div2_clk;
-
struct clk_gate rgmii_tx_en;
-   struct clk  *rgmii_tx_en_clk;
+
+   struct clk  *rgmii_tx_clk;
 
u32 tx_delay_ns;
 };
@@ -82,106 +76,95 @@ static void meson8b_dwmac_mask_bits(struct meson8b_dwmac 
*dwmac, u32 reg,
writel(data, dwmac->regs + reg);
 }
 
-static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac *dwmac)
+static struct clk *meson8b_dwmac_register_clk(struct meson8b_dwmac *dwmac,
+ const char *name_suffix,
+ const char **parent_names,
+ int num_parents,
+ const struct clk_ops *ops,
+ struct clk_hw *hw)
 {
+   struct device *dev = >pdev->dev;
struct clk_init_data init;
+   char clk_name[32];
+
+   snprintf(clk_name, sizeof(clk_name), "%s#%s", dev_name(dev),
+name_suffix);
+
+   init.name = clk_name;
+   init.ops = ops;
+   init.flags = CLK_SET_RATE_PARENT;
+   init.parent_names = parent_names;
+   init.num_parents = num_parents;
+
+   hw->init = 
+
+   return devm_clk_register(dev, hw);
+}
+
+static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac *dwmac)
+{
int i, ret;
+   struct clk *clk;
struct device *dev = >pdev->dev;
-   char clk_name[32];
-   const char *clk_div_parents[1];
-   const char *mux_parent_names[MUX_CLK_NUM_PARENTS];
+   const char *parent_name, *mux_parent_names[MUX_CLK_NUM_PARENTS];
 
/* get the mux parents from DT */
for (i = 0; i < MUX_CLK_NUM_PARENTS; i++) {
char name[16];
 
snprintf(name, sizeof(name), "clkin%d", i);
-   dwmac->m250_mux_parent[i] = devm_clk_get(dev, name);
-   if (IS_ERR(dwmac->m250_mux_parent[i])) {
-   ret = PTR_ERR(dwmac->m250_mux_parent[i]);
+   clk = devm_clk_get(dev, name);
+   if (IS_ERR(clk)) {
+   ret = PTR_ERR(clk);
if (ret != -EPROBE_DEFER)
dev_err(dev, "Missing clock %s\n", name);
return ret;
}
 
-   mux_parent_names[i] =
-   __clk_get_name(dwmac->m250_mux_parent[i]);
+   mux_parent_names[i] = __clk_get_name(clk);
}
 
-   /* create the m250_mux */
-   snprintf(clk_name, sizeof(clk_name), "%s#m250_sel", dev_name(dev));
-   init.name = clk_name;
-   init.ops = _mux_ops;
-   init.flags = CLK_SET_RATE_PARENT;
-   init.parent_names = mux_parent_names;
-   init.num_parents = MUX_CLK_NUM_PARENTS;
-
dwmac->m250_mux.reg = dwmac->regs + PRG_ETH0;
dwmac->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
dwmac->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
-   dwmac->m250_mux.flags = 0;
-   dwmac->m250_mux.table = NULL;
-   dwmac->m250_mux.hw.init = 
-
-   dwmac->m250_mux_clk = devm_clk_register(dev, >m250_mux.hw);
-   if (WARN_ON(IS_ERR(dwmac->m250_mux_clk)))
-   return PTR_ERR(dwmac->m250_mux_clk);
-
-   /* create the m250_div */
-   snprintf(clk_name, sizeof(clk_name), "%s#m250_div", 

[net-next PATCH v1 3/3] net: stmmac: dwmac-meson8b: make the clock configurations private

2018-02-17 Thread Martin Blumenstingl
The common clock framework needs access to the "clock configuration"
structs during runtime.
However, only the common clock framework should access these. Ensure
this by moving the configuration structs out of struct meson8b_dwmac,
so only meson8b_init_rgmii_tx_clk() and the common clock framework know
about these configurations.

Suggested-by: Jerome Brunet 
Signed-off-by: Martin Blumenstingl 
---
 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 45 --
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 0dfce35c5583..2d5d4aea3bcb 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -49,19 +49,17 @@
 
 struct meson8b_dwmac {
struct device   *dev;
-
void __iomem*regs;
-
phy_interface_t phy_mode;
+   struct clk  *rgmii_tx_clk;
+   u32 tx_delay_ns;
+};
 
+struct meson8b_dwmac_clk_configs {
struct clk_mux  m250_mux;
struct clk_divider  m250_div;
struct clk_fixed_factor fixed_div2;
struct clk_gate rgmii_tx_en;
-
-   struct clk  *rgmii_tx_clk;
-
-   u32 tx_delay_ns;
 };
 
 static void meson8b_dwmac_mask_bits(struct meson8b_dwmac *dwmac, u32 reg,
@@ -106,6 +104,11 @@ static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac 
*dwmac)
struct clk *clk;
struct device *dev = dwmac->dev;
const char *parent_name, *mux_parent_names[MUX_CLK_NUM_PARENTS];
+   struct meson8b_dwmac_clk_configs *clk_configs;
+
+   clk_configs = devm_kzalloc(dev, sizeof(*clk_configs), GFP_KERNEL);
+   if (!clk_configs)
+   return -ENOMEM;
 
/* get the mux parents from DT */
for (i = 0; i < MUX_CLK_NUM_PARENTS; i++) {
@@ -123,43 +126,43 @@ static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac 
*dwmac)
mux_parent_names[i] = __clk_get_name(clk);
}
 
-   dwmac->m250_mux.reg = dwmac->regs + PRG_ETH0;
-   dwmac->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
-   dwmac->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
+   clk_configs->m250_mux.reg = dwmac->regs + PRG_ETH0;
+   clk_configs->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
+   clk_configs->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
clk = meson8b_dwmac_register_clk(dwmac, "m250_sel", mux_parent_names,
 MUX_CLK_NUM_PARENTS, _mux_ops,
->m250_mux.hw);
+_configs->m250_mux.hw);
if (WARN_ON(IS_ERR(clk)))
return PTR_ERR(clk);
 
parent_name = __clk_get_name(clk);
-   dwmac->m250_div.reg = dwmac->regs + PRG_ETH0;
-   dwmac->m250_div.shift = PRG_ETH0_CLK_M250_DIV_SHIFT;
-   dwmac->m250_div.width = PRG_ETH0_CLK_M250_DIV_WIDTH;
-   dwmac->m250_div.flags = CLK_DIVIDER_ONE_BASED |
+   clk_configs->m250_div.reg = dwmac->regs + PRG_ETH0;
+   clk_configs->m250_div.shift = PRG_ETH0_CLK_M250_DIV_SHIFT;
+   clk_configs->m250_div.width = PRG_ETH0_CLK_M250_DIV_WIDTH;
+   clk_configs->m250_div.flags = CLK_DIVIDER_ONE_BASED |
CLK_DIVIDER_ALLOW_ZERO |
CLK_DIVIDER_ROUND_CLOSEST;
clk = meson8b_dwmac_register_clk(dwmac, "m250_div", _name, 1,
 _divider_ops,
->m250_div.hw);
+_configs->m250_div.hw);
if (WARN_ON(IS_ERR(clk)))
return PTR_ERR(clk);
 
parent_name = __clk_get_name(clk);
-   dwmac->fixed_div2.mult = 1;
-   dwmac->fixed_div2.div = 2;
+   clk_configs->fixed_div2.mult = 1;
+   clk_configs->fixed_div2.div = 2;
clk = meson8b_dwmac_register_clk(dwmac, "fixed_div2", _name, 1,
 _fixed_factor_ops,
->fixed_div2.hw);
+_configs->fixed_div2.hw);
if (WARN_ON(IS_ERR(clk)))
return PTR_ERR(clk);
 
parent_name = __clk_get_name(clk);
-   dwmac->rgmii_tx_en.reg = dwmac->regs + PRG_ETH0;
-   dwmac->rgmii_tx_en.bit_idx = PRG_ETH0_RGMII_TX_CLK_EN;
+   clk_configs->rgmii_tx_en.reg = dwmac->regs + PRG_ETH0;
+   clk_configs->rgmii_tx_en.bit_idx = PRG_ETH0_RGMII_TX_CLK_EN;
clk = meson8b_dwmac_register_clk(dwmac, "rgmii_tx_en", _name, 1,
 _gate_ops,
->rgmii_tx_en.hw);
+_configs->rgmii_tx_en.hw);
if (WARN_ON(IS_ERR(clk)))

[net-next PATCH v1 2/3] net: stmmac: dwmac-meson8b: only keep struct device around

2018-02-17 Thread Martin Blumenstingl
Nothing in the dwmac-meson8b driver (except .probe itself) requires the
platform_device anymore after .probe has finished. Replace it with a
pointer to struct device since this is what the functions inside the
driver are actually accessing.
No functional changes.

Signed-off-by: Martin Blumenstingl 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 84a9a900e74e..0dfce35c5583 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -48,7 +48,7 @@
 #define MUX_CLK_NUM_PARENTS2
 
 struct meson8b_dwmac {
-   struct platform_device  *pdev;
+   struct device   *dev;
 
void __iomem*regs;
 
@@ -83,11 +83,10 @@ static struct clk *meson8b_dwmac_register_clk(struct 
meson8b_dwmac *dwmac,
  const struct clk_ops *ops,
  struct clk_hw *hw)
 {
-   struct device *dev = >pdev->dev;
struct clk_init_data init;
char clk_name[32];
 
-   snprintf(clk_name, sizeof(clk_name), "%s#%s", dev_name(dev),
+   snprintf(clk_name, sizeof(clk_name), "%s#%s", dev_name(dwmac->dev),
 name_suffix);
 
init.name = clk_name;
@@ -98,14 +97,14 @@ static struct clk *meson8b_dwmac_register_clk(struct 
meson8b_dwmac *dwmac,
 
hw->init = 
 
-   return devm_clk_register(dev, hw);
+   return devm_clk_register(dwmac->dev, hw);
 }
 
 static int meson8b_init_rgmii_tx_clk(struct meson8b_dwmac *dwmac)
 {
int i, ret;
struct clk *clk;
-   struct device *dev = >pdev->dev;
+   struct device *dev = dwmac->dev;
const char *parent_name, *mux_parent_names[MUX_CLK_NUM_PARENTS];
 
/* get the mux parents from DT */
@@ -204,19 +203,19 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac 
*dwmac)
 */
ret = clk_set_rate(dwmac->rgmii_tx_clk, 125 * 1000 * 1000);
if (ret) {
-   dev_err(>pdev->dev,
+   dev_err(dwmac->dev,
"failed to set RGMII TX clock\n");
return ret;
}
 
ret = clk_prepare_enable(dwmac->rgmii_tx_clk);
if (ret) {
-   dev_err(>pdev->dev,
+   dev_err(dwmac->dev,
"failed to enable the RGMII TX clock\n");
return ret;
}
 
-   devm_add_action_or_reset(>pdev->dev,
+   devm_add_action_or_reset(dwmac->dev,
(void(*)(void *))clk_disable_unprepare,
dwmac->rgmii_tx_clk);
break;
@@ -238,7 +237,7 @@ static int meson8b_init_prg_eth(struct meson8b_dwmac *dwmac)
break;
 
default:
-   dev_err(>pdev->dev, "unsupported phy-mode %s\n",
+   dev_err(dwmac->dev, "unsupported phy-mode %s\n",
phy_modes(dwmac->phy_mode));
return -EINVAL;
}
@@ -279,7 +278,7 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
goto err_remove_config_dt;
}
 
-   dwmac->pdev = pdev;
+   dwmac->dev = >dev;
dwmac->phy_mode = of_get_phy_mode(pdev->dev.of_node);
if (dwmac->phy_mode < 0) {
dev_err(>dev, "missing phy-mode property\n");
-- 
2.16.1



[net-next PATCH v1 0/3] dwmac-meson8b: small cleanup

2018-02-17 Thread Martin Blumenstingl
This is a follow-up to my previous series "dwmac-meson8b: clock fixes for
Meson8b" from [0].
during the review of that series it was found that the clock registration
could be simplified. now that the previous series has landed we can start
cleaning up the clock registration.

the goal of this series is to simplify the code in the dwmac-meson8b
driver. no functional changes are intended.

I have tested this on my Khadas VIM2 (GXM SoC, with RGMII PHY) and my
Endless Mini (EC-100, Meson8b SoC with RMII PHY, .dts support is not part
of mainline yet). no problems were found.


[0] http://lists.infradead.org/pipermail/linux-amlogic/2018-January/006143.html

Martin Blumenstingl (3):
  net: stmmac: dwmac-meson8b: simplify clock registration
  net: stmmac: dwmac-meson8b: only keep struct device around
  net: stmmac: dwmac-meson8b: make the clock configurations private

 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c| 208 +
 1 file changed, 92 insertions(+), 116 deletions(-)

-- 
2.16.1



[PATCH 3/3] ip: Allow rules to accept a specified protocol

2018-02-17 Thread Donald Sharp
Allow the specification of a protocol when the user
adds/modifies/deletes a rule.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index 5703d6e4..8fc6ac48 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -324,6 +324,12 @@ static int iprule_modify(int cmd, int argc, char **argv)
if (get_rt_realms(, *argv))
invarg("invalid realms\n", *argv);
addattr32(, sizeof(req), FRA_FLOW, realm);
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 proto;
+   NEXT_ARG();
+   if (rtnl_rtprot_a2n(, *argv))
+   invarg("\"protocol\" value is invalid\n", 
*argv);
+   req.frh.proto = proto;
} else if (matches(*argv, "table") == 0 ||
   strcmp(*argv, "lookup") == 0) {
NEXT_ARG();
-- 
2.14.3



[PATCH 1/3] ip: Use the `struct fib_rule_hdr` for rules

2018-02-17 Thread Donald Sharp
The iprule.c code was using `struct rtmsg` as the data
type to pass into the kernel for the netlink message.
While 'struct rtmsg' and `struct fib_rule_hdr` are
the same size and mostly the same, we should use
the correct data structure.  This commit translates
the data structures to have iprule.c use the correct
one.

Additionally copy over the modified fib_rules.h file

Signed-off-by: Donald Sharp 
---
 include/linux/fib_rules.h |   2 +-
 ip/iprule.c   | 105 --
 2 files changed, 56 insertions(+), 51 deletions(-)

diff --git a/include/linux/fib_rules.h b/include/linux/fib_rules.h
index bbf02a63..21f1fbf3 100644
--- a/include/linux/fib_rules.h
+++ b/include/linux/fib_rules.h
@@ -22,8 +22,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
+   __u8proto;  /* reserved */
__u8res1;   /* reserved */
-   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
diff --git a/ip/iprule.c b/ip/iprule.c
index 854a3d8e..82e22fee 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -50,7 +50,7 @@ static void usage(void)
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE*)arg;
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
__u32 table;
@@ -61,13 +61,13 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
return 0;
 
-   len -= NLMSG_LENGTH(sizeof(*r));
+   len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
return -1;
 
-   parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len);
+   parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
-   host_len = af_bit_len(r->rtm_family);
+   host_len = af_bit_len(frh->family);
 
if (n->nlmsg_type == RTM_DELRULE)
fprintf(fp, "Deleted ");
@@ -77,51 +77,51 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else
fprintf(fp, "0:\t");
 
-   if (r->rtm_flags & FIB_RULE_INVERT)
+   if (frh->flags & FIB_RULE_INVERT)
fprintf(fp, "not ");
 
if (tb[FRA_SRC]) {
-   if (r->rtm_src_len != host_len) {
-   fprintf(fp, "from %s/%u ", rt_addr_n2a(r->rtm_family,
+   if (frh->src_len != host_len) {
+   fprintf(fp, "from %s/%u ", rt_addr_n2a(frh->family,
   RTA_PAYLOAD(tb[FRA_SRC]),
   RTA_DATA(tb[FRA_SRC]),
   abuf, sizeof(abuf)),
-   r->rtm_src_len
+   frh->src_len
);
} else {
-   fprintf(fp, "from %s ", format_host(r->rtm_family,
+   fprintf(fp, "from %s ", format_host(frh->family,
   RTA_PAYLOAD(tb[FRA_SRC]),
   RTA_DATA(tb[FRA_SRC]),
   abuf, sizeof(abuf))
);
}
-   } else if (r->rtm_src_len) {
-   fprintf(fp, "from 0/%d ", r->rtm_src_len);
+   } else if (frh->src_len) {
+   fprintf(fp, "from 0/%d ", frh->src_len);
} else {
fprintf(fp, "from all ");
}
 
if (tb[FRA_DST]) {
-   if (r->rtm_dst_len != host_len) {
-   fprintf(fp, "to %s/%u ", rt_addr_n2a(r->rtm_family,
+   if (frh->dst_len != host_len) {
+   fprintf(fp, "to %s/%u ", rt_addr_n2a(frh->family,
   RTA_PAYLOAD(tb[FRA_DST]),
   RTA_DATA(tb[FRA_DST]),
   abuf, sizeof(abuf)),
-   r->rtm_dst_len
+   frh->dst_len
);
} else {
-   fprintf(fp, "to %s ", format_host(r->rtm_family,
+   fprintf(fp, "to %s ", format_host(frh->family,
   RTA_PAYLOAD(tb[FRA_DST]),
   RTA_DATA(tb[FRA_DST]),
   abuf, sizeof(abuf)));
}
-   } else if (r->rtm_dst_len) {
-   fprintf(fp, "to 0/%d ", r->rtm_dst_len);
+   } else if (frh->dst_len) {
+   fprintf(fp, "to 0/%d ", 

[PATCH 0/3] Allow 'ip rule' command to use protocol

2018-02-17 Thread Donald Sharp
Fix iprule.c to use the actual `struct fib_rule_hdr` and to
allow the end user to see and use the protocol keyword
for rule manipulations.

Donald Sharp (3):
  ip: Use the `struct fib_rule_hdr` for rules
  ip: Display ip rule protocol used
  ip: Allow rules to accept a specified protocol

 include/linux/fib_rules.h |   2 +-
 ip/iprule.c   | 114 ++
 2 files changed, 65 insertions(+), 51 deletions(-)

-- 
2.14.3



[PATCH 2/3] ip: Display ip rule protocol used

2018-02-17 Thread Donald Sharp
Newer kernels are now accepting a protocol from the installing
program for who installed the rule.  This change allows us
to see this change if it is being specified by the installing
program.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index 82e22fee..5703d6e4 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -213,6 +213,9 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else if (frh->action != RTN_UNICAST)
fprintf(fp, "%s", rtnl_rtntype_n2a(frh->action, b1, 
sizeof(b1)));
 
+   if (frh->proto != RTPROT_UNSPEC)
+   fprintf(fp, " proto %s ",
+   rtnl_rtprot_n2a(frh->proto, b1, sizeof(b1)));
fprintf(fp, "\n");
fflush(fp);
return 0;
-- 
2.14.3



Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte
Hi Daniel,

On Fri, Feb 16, 2018 at 02:40:19PM +0100, Daniel Borkmann wrote:
> This is a very rough and early proof of concept that implements bpfilter.
> The basic idea of bpfilter is that it can process iptables queries and
> translate them in user space into BPF programs which can then get attached
> at various locations. 

Interesting approach.  My first question would be what the goal of all of this
is.  For sure, one can implement many different things, but what is the use
case, and why do it this way?

I see several possible areas of contention:

1) If you aim for a non-feature-complete support of iptables rules, it
   will create confusion to the users.  When users use "iptables", they have
   assumptions on what it will do and how it will behave.  One can of course
   replace / refactor the internal implementation, if the resulting behavior
   is identical.  And that means rules are executed at the same hooks in the 
stack,
   with functionally identical matches and targets, provide the same
   counter semantics, etc.  But if the behavior is different, and/or the
   provided functionality is different, then why "hide" this new
   filtering technology behind iptables, rather than its own command
   line tool?  Such an alternative tool could share the same command
   line syntax as iptables, or even provide a converter/wrapper, but
   given that it would not be called "iptables" people will implicitly
   have different assumptions about it

2) Why try to provide compatibility to iptables, when at the same time
   many people have already migrated to (or are in the process of
   migrating) to nftables?  By using iptables semantics, structures,
   architecture, you risk perpetuating the design mistakes we made in
   iptables some 18 years ago for another decade or more.  From my POV,
   if one was to do eBPF optimized rule execution, it should be based on
   nftables rather than iptables.  This way you avoid the many
   architectural problems, such as
   * no incremental rule changes but only atomic swap of an entire table
 with all its chains
   * no common/shared rulesets for IPv4 + IPv6, which is very clumsy and
 often worked around with ugly shellscript wrappers in userspace
 which then call both iptables and ip6tables to add a rule to both
 rulesets.

> The user space iptables binary issuing rule addition or dumps was
> left as-is, thus at some point any binaries against iptables uapi kernel
> interface could transparently be supported in such manner in long term.

See my comments above:  In the netfilter community, we know for at least
a decade or more about the many problems of the old iptables userspace
interface.  For many years, a much better replacement has been designed
as part of nftables.

> As rule translation can potentially become very complex, this is performed
> entirely in user space. In order to ease deployment, request_module() code
> is extended to allow user mode helpers to be invoked. Idea is that user mode
> helpers are built as part of the kernel build and installed as traditional
> kernel modules with .ko file extension into distro specified location,
> such that from a distribution point of view, they are no different than
> regular kernel modules. 

That just blew my mind, sorry :)  This goes much beyond
netfilter/iptables, and adds some quiet singificant new piece of
kernel/userspace infrastructure.  To me, my apologies, it just sounds
like a quite strange hack.  But then, I may lack the vision of how this
might be useful in other contexts.

I'm trying to understand why exactly one would
* use a 18 year old iptables userspace program with its equally old
  setsockopt based interface between kernel and userspace
* insert an entire table with many chains of rules into the kernel
* re-eject that ruleset into another userspace program which then
  compiles it into an eBPF program
* inserert that back into the kernel

To me, this looks like some kind of legacy backwards compatibility
mechanism that one would find in proprietary operating systems, but not
in Linux.  iptables, libiptc etc. are all free software.  The source
code can be edited, and you could just as well have a new version of
iptables and/or libiptc which would pass the ruleset in userspace to
your compiler, which would then insert the resulting eBPF program.

You could even have a LD_PRELOAD wrapper doing the same.  That one
would even work with direct users of the iptables setsockopt inteerface.

Why add quite comprehensive kerne infrastructure?  What's the motivation
here?

> Thus, allow request_module() logic to load such
> user mode helper (umh) binaries via:
> 
>   request_module("foo") ->
> call_umh("modprobe foo") ->
>   sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)
> 
> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack 

Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte
Hi Daniel,

On Fri, Feb 16, 2018 at 09:44:01PM +0100, Daniel Borkmann wrote:
> We started out with the
> iptables part in the demo as the majority of bigger infrastructure projects
> all still rely heavily on it (e.g. docker, k8s to just name two big ones).

docker is exec'ing the iptables command line program.  So one could simply
offer a syntactically compatible userspace replacement that does the compilation
in userspce and avoid the iptables->libiptc->setsockopt->userspace roundtrip
and the associated changes to the kernel module loader you introduced.

kubernetes is using iptables-restore, which is part of iptables and
again has the same syntax.  However, it aovids the per-rule fork+exec
overhead, which is why the netfilter project has been recommending it to
be used in such situations.

Do you have a list of known projects that use the legacy sockopt-based
iptables uapi directly, without using code from the iptables.git
codebase (e.g. libiptc, iptables or iptables-restore)?  IMHO only
those projects would benefit from the approach you have taken vs. an
approach that simply offers a compatible commandline syntax.

> Usually they have their requests to iptables baked into their code directly
> which probably won't change any time soon, so thought was that they could
> benefit initially from it once there would be sufficient coverage.

If the binary offeers the same syntax (it could even be a fork/version
of the iptables codebase, only using the parsing without the existing
backend generating the ruleS), the same goal could be achieved.

The above of course assumes that you have a 100% functional replacement
(for 100% of the features that your use cases use) underneath the
"iptables command syntax" compatibility.  But you need that in both
cases, whether you use the existing userspace api or not.

Regards,
Harald
-- 
- Harald Welte    http://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)


[PATCH 1/2] net: Allow a rule to track originating protocol

2018-02-17 Thread Donald Sharp
Allow a rule that is being added/deleted/modified or
dumped to contain the originating protocol's id.

The protocol is handled just like a routes originating
protocol is.  This is especially useful because there
is starting to be a plethora of different user space
programs adding rules.

Signed-off-by: Donald Sharp 
---
 include/net/fib_rules.h| 3 ++-
 include/uapi/linux/fib_rules.h | 2 +-
 net/core/fib_rules.c   | 7 ++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index 648caf90ec07..b166ef07e6d4 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -26,7 +26,8 @@ struct fib_rule {
u32 table;
u8  action;
u8  l3mdev;
-   /* 2 bytes hole, try to use */
+   u8  proto;
+   /* 1 byte hole, try to use */
u32 target;
__be64  tun_id;
struct fib_rule __rcu   *ctarget;
diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b642bf9b5a0..925539172d5b 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -23,8 +23,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
+   __u8proto;
__u8res1;   /* reserved */
-   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 98e1066c3d55..c1d4ab5b2d9f 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -51,6 +51,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
r->pref = pref;
r->table = table;
r->flags = flags;
+   r->proto = RTPROT_KERNEL;
r->fr_net = ops->fro_net;
r->uid_range = fib_kuid_range_unset;
 
@@ -465,6 +466,7 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
}
refcount_set(>refcnt, 1);
rule->fr_net = net;
+   rule->proto = frh->proto;
 
rule->pref = tb[FRA_PRIORITY] ? nla_get_u32(tb[FRA_PRIORITY])
  : fib_default_rule_pref(ops);
@@ -664,6 +666,9 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
}
 
list_for_each_entry(rule, >rules_list, list) {
+   if (frh->proto && (frh->proto != rule->proto))
+   continue;
+
if (frh->action && (frh->action != rule->action))
continue;
 
@@ -808,9 +813,9 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct 
fib_rule *rule,
if (nla_put_u32(skb, FRA_SUPPRESS_PREFIXLEN, rule->suppress_prefixlen))
goto nla_put_failure;
frh->res1 = 0;
-   frh->res2 = 0;
frh->action = rule->action;
frh->flags = rule->flags;
+   frh->proto = rule->proto;
 
if (rule->action == FR_ACT_GOTO &&
rcu_access_pointer(rule->ctarget) == NULL)
-- 
2.14.3



[PATCH 2/2] drivers: Modify vrf device to specify it's rule as RTPROT_KERNEL

2018-02-17 Thread Donald Sharp
Allow the vrf device to specify that the kernel is the originator
of the rule created for this device.

Signed-off-by: Donald Sharp 
---
 drivers/net/vrf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 139c61c8244a..ec6d2d623b60 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -1175,6 +1175,7 @@ static int vrf_fib_rule(const struct net_device *dev, 
__u8 family, bool add_it)
memset(frh, 0, sizeof(*frh));
frh->family = family;
frh->action = FR_ACT_TO_TBL;
+   frh->proto = RTPROT_KERNEL;
 
if (nla_put_u8(skb, FRA_L3MDEV, 1))
goto nla_put_failure;
-- 
2.14.3



[PATCH 0/2] Allow rules to track originating protocol

2018-02-17 Thread Donald Sharp
Add the ability for the kernel to track the originating protocol
for when new rules are added to the kernel.

Donald Sharp (2):
  net: Allow a rule to track originating protocol
  drivers: Modify vrf device to specify it's rule as RTPROT_KERNEL

 drivers/net/vrf.c  | 1 +
 include/net/fib_rules.h| 3 ++-
 include/uapi/linux/fib_rules.h | 2 +-
 net/core/fib_rules.c   | 7 ++-
 4 files changed, 10 insertions(+), 3 deletions(-)

-- 
2.14.3



Re: [PATCH RFC 0/4] net: add bpfilter

2018-02-17 Thread Harald Welte
Hi David,

On Fri, Feb 16, 2018 at 05:33:54PM -0500, David Miller wrote:
> From: Florian Westphal 
> 
> > Any particular reason why translating iptables rather than nftables
> > (it should be possible to monitor the nftables changes that are
> >  announced by kernel and act on those)?
> 
> As Daniel said, iptables is by far the most deployed of the two
> technologies.  Therefore it provides the largest environment for
> testing and coverage.

As I outlined earlier, this way you are perpetuating the architectural
mistakes and constraints that were created ~ 18 years ago without any
benefit from the lessons learned ever since.  In netfilter, we already
wanted to replace it as early as 2006 (AFAIR) with nfnetlink based
pkttables (which never materialized).

I would strongly suggest to focus on nftables (or even some other way of
configuration / userspace interaction) to ensure that the iptables
userspace interface can at some point be phased out eventually.  Like we
did with ipchains before, and before that with ipfwadm.

By making a new implementation dependant on the oldest interface you are
perpetuating it.  Sure, one can go that way, but I would suggest this to
be a *very* carefully weighed decision after a detailed
analysis/discusison.

-- 
- Harald Welte    http://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)


Re: TCP and BBR: reproducibly low cwnd and bandwidth

2018-02-17 Thread Oleksandr Natalenko
Hi.

On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote:
> Well, no effect  here on e1000e (1 Gbit) at least
> 
> # ethtool -K eth3 sg off
> Actual changes:
> scatter-gather: off
> tx-scatter-gather: off
> tcp-segmentation-offload: off
> tx-tcp-segmentation: off [requested on]
> tx-tcp6-segmentation: off [requested on]
> generic-segmentation-offload: off [requested on]
> 
> # tc qd replace dev eth3 root pfifo_fast
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> # tc qd replace dev eth3 root fq
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> # tc qd replace dev eth3 root fq_codel
> # ./super_netperf 1 -H 7.7.7.84 -- -K cubic
> 941
> # ./super_netperf 1 -H 7.7.7.84 -- -K bbr
> 941
> #

That really looks strange to me. I'm able to reproduce the effect caused by 
disabling scatter-gather even on the VM (using iperf3, as usual):

BBR+fq_codel:
sg on:  4.23 Gbits/sec
sg off: 121 Mbits/sec

BBR+fq:
sg on:  6.38 Gbits/sec
sg off: 437 Mbits/sec

Reno+fq_codel:
sg on:  6.74 Gbits/sec
sg off: 1.37 Gbits/sec

Reno+fq:
sg on:  6.53 Gbits/sec
sg off: 1.19 Gbits/sec

Regardless of which congestion algorithm and qdisc is in use, the throughput 
drops, but when BBR is in use, especially with something non-fq, it drops the 
most.

Oleksandr




Re: [PATCH v2] ravb: add support for changing MTU

2018-02-17 Thread Sergei Shtylyov

Hello!

On 2/16/2018 7:10 PM, Niklas Söderlund wrote:


Allow for changing the MTU within the limit of the maximum size of a
descriptor (2048 bytes). Add the callback to change MTU from user-space
and take the configurable MTU into account when configuring the
hardware.

Signed-off-by: Niklas Söderlund 
---
  drivers/net/ethernet/renesas/ravb.h  |  1 +
  drivers/net/ethernet/renesas/ravb_main.c | 34 +---
  2 files changed, 28 insertions(+), 7 deletions(-)

* Changes since v1
- Fix spelling error.
- s/le16_to_cpu(rx_desc->ds_cc)/priv->rx_buf_sz/.
- Drop ETH_FCS_LEN + 16 from rx_buf_sz calculation as pointed out by
   Sergei.

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index 96a27b00c90e212a..b81f4faf7b10114d 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1018,6 +1018,7 @@ struct ravb_private {
u32 dirty_rx[NUM_RX_QUEUE]; /* Producer ring indices */
u32 cur_tx[NUM_TX_QUEUE];
u32 dirty_tx[NUM_TX_QUEUE];
+   u32 rx_buf_sz;  /* Based on MTU+slack. */
struct napi_struct napi[NUM_RX_QUEUE];
struct work_struct work;
/* MII transceiver section. */
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index c87f57ca44371586..34e841306e04a3d3 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -238,7 +238,7 @@ static void ravb_ring_free(struct net_device *ndev, int q)
   le32_to_cpu(desc->dptr)))
dma_unmap_single(ndev->dev.parent,
 le32_to_cpu(desc->dptr),
-PKT_BUF_SZ,
+priv->rx_buf_sz,
 DMA_FROM_DEVICE);
}
ring_size = sizeof(struct ravb_ex_rx_desc) *
@@ -300,9 +300,9 @@ static void ravb_ring_format(struct net_device *ndev, int q)
for (i = 0; i < priv->num_rx_ring[q]; i++) {
/* RX descriptor */
rx_desc = >rx_ring[q][i];
-   rx_desc->ds_cc = cpu_to_le16(PKT_BUF_SZ);
+   rx_desc->ds_cc = cpu_to_le16(priv->rx_buf_sz);
dma_addr = dma_map_single(ndev->dev.parent, 
priv->rx_skb[q][i]->data,
- PKT_BUF_SZ,
+ priv->rx_buf_sz,
  DMA_FROM_DEVICE);
/* We just set the data size to 0 for a failed mapping which
 * should prevent DMA from happening...
@@ -346,6 +346,10 @@ static int ravb_ring_init(struct net_device *ndev, int q)
int ring_size;
int i;
  
+	/* +16 gets room from the status from the card. */


   You removed the addition of 16 but left the comment intact. :-)
Will have to fix it up in a follow-up patch now...


+   priv->rx_buf_sz = (ndev->mtu <= 1492 ? PKT_BUF_SZ : ndev->mtu) +


   Not sure why PKT_BUF_SZ is 1538, sigh...


+   ETH_HLEN + VLAN_HLEN;
+
/* Allocate RX and TX skb rings */
priv->rx_skb[q] = kcalloc(priv->num_rx_ring[q],
  sizeof(*priv->rx_skb[q]), GFP_KERNEL);

[...]

MBR, Sergei