date:20180505

[PATCHv2 net] sctp: delay the authentication for the duplicated cookie-echo chunk

2018-05-05 Thread Xin Long

Now sctp only delays the authentication for the normal cookie-echo
chunk by setting chunk->auth_chunk in sctp_endpoint_bh_rcv(). But
for the duplicated one with auth, in sctp_assoc_bh_rcv(), it does
authentication first based on the old asoc, which will definitely
fail due to the different auth info in the old asoc.

The duplicated cookie-echo chunk will create a new asoc with the
auth info from this chunk, and the authentication should also be
done with the new asoc's auth info for all of the collision 'A',
'B' and 'D'. Otherwise, the duplicated cookie-echo chunk with auth
will never pass the authentication and create the new connection.

This issue exists since very beginning, and this fix is to make
sctp_assoc_bh_rcv() follow the way sctp_endpoint_bh_rcv() does
for the normal cookie-echo chunk to delay the authentication.

While at it, remove the unused params from sctp_sf_authenticate()
and define sctp_auth_chunk_verify() used for all the places that
do the delayed authentication.

v1->v2:
  fix the typo in changelog as Marcelo noticed.

Acked-by: Marcelo Ricardo Leitner 
Signed-off-by: Xin Long 
---
 net/sctp/associola.c| 30 -
 net/sctp/sm_statefuns.c | 86 ++---
 2 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 837806d..a47179d 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1024,8 +1024,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
struct sctp_endpoint *ep;
struct sctp_chunk *chunk;
struct sctp_inq *inqueue;
-   int state;
+   int first_time = 1; /* is this the first time through the loop */
int error = 0;
+   int state;
 
/* The association should be held so we should be safe. */
ep = asoc->ep;
@@ -1036,6 +1037,30 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
state = asoc->state;
subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
 
+   /* If the first chunk in the packet is AUTH, do special
+* processing specified in Section 6.3 of SCTP-AUTH spec
+*/
+   if (first_time && subtype.chunk == SCTP_CID_AUTH) {
+   struct sctp_chunkhdr *next_hdr;
+
+   next_hdr = sctp_inq_peek(inqueue);
+   if (!next_hdr)
+   goto normal;
+
+   /* If the next chunk is COOKIE-ECHO, skip the AUTH
+* chunk while saving a pointer to it so we can do
+* Authentication later (during cookie-echo
+* processing).
+*/
+   if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
+   chunk->auth_chunk = skb_clone(chunk->skb,
+ GFP_ATOMIC);
+   chunk->auth = 1;
+   continue;
+   }
+   }
+
+normal:
/* SCTP-AUTH, Section 6.3:
 *The receiver has a list of chunk types which it expects
 *to be received only after an AUTH-chunk.  This list has
@@ -1074,6 +1099,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
/* If there is an error on chunk, discard this packet. */
if (error && chunk)
chunk->pdiscard = 1;
+
+   if (first_time)
+   first_time = 0;
}
sctp_association_put(asoc);
 }
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 28c070e..c9ae340 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -153,10 +153,7 @@ static enum sctp_disposition sctp_sf_violation_chunk(
struct sctp_cmd_seq *commands);
 
 static enum sctp_ierror sctp_sf_authenticate(
-   struct net *net,
-   const struct sctp_endpoint *ep,
const struct sctp_association *asoc,
-   const union sctp_subtype type,
struct sctp_chunk *chunk);
 
 static enum sctp_disposition __sctp_sf_do_9_1_abort(
@@ -626,6 +623,38 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net *net,
return SCTP_DISPOSITION_CONSUME;
 }
 
+static bool sctp_auth_chunk_verify(struct net *net, struct sctp_chunk *chunk,
+  const struct sctp_association *asoc)
+{
+   struct sctp_chunk auth;
+
+   if (!chunk->auth_chunk)
+   return true;
+
+   /* SCTP-AUTH:  auth_chunk pointer is only set when the cookie-echo
+* is supposed to be authenticated and we have to do delayed
+* authentication.  We've just recreated

Re: [PATCH net] sctp: delay the authentication for the duplicated cookie-echo chunk

2018-05-05 Thread Xin Long

On Sat, May 5, 2018 at 6:33 AM, Marcelo Ricardo Leitner
 wrote:
> On Fri, May 04, 2018 at 05:05:10PM +0800, Xin Long wrote:
>> Now sctp only delays the authentication for the normal cookie-echo
>> chunk by setting chunk->auth_chunk in sctp_endpoint_bh_rcv(). But
>> for the duplicated one with auth, in sctp_assoc_bh_rcv(), it does
>> authentication first based on the old asoc, which will definitely
>> fail due to the different auth info in the old asoc.
>>
>> The duplicated cookie-echo chunk will create a new asoc with the
>> auth info from this chunk, and the authentication should also be
>> done with the new asoc's auth info for all of the collision 'A',
>> 'B' and 'D'. Otherwise, the duplicated cookie-echo chunk with auth
>> will never pass the authentication and create the new connection.
>>
>> This issue exists since very beginning, and this fix is to make
>> sctp_assoc_bh_rcv() follow the way sctp_assoc_bh_rcv() does for
>I guess you meant sctp_endpoint_bh_rcv here --^ right?
have posted v2, thanks.

>
> Other than this LGTM
>
>> the normal cookie-echo chunk to delay the authentication.
>>
>> While at it, remove the unused params from sctp_sf_authenticate()
>> and define sctp_auth_chunk_verify() used for all the places that
>> do the delayed authentication.
>>
>> Signed-off-by: Xin Long 
>> ---
>>  net/sctp/associola.c| 30 -
>>  net/sctp/sm_statefuns.c | 86 
>> ++---
>>  2 files changed, 75 insertions(+), 41 deletions(-)
>>
>> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
>> index 837806d..a47179d 100644
>> --- a/net/sctp/associola.c
>> +++ b/net/sctp/associola.c
>> @@ -1024,8 +1024,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>   struct sctp_endpoint *ep;
>>   struct sctp_chunk *chunk;
>>   struct sctp_inq *inqueue;
>> - int state;
>> + int first_time = 1; /* is this the first time through the loop */
>>   int error = 0;
>> + int state;
>>
>>   /* The association should be held so we should be safe. */
>>   ep = asoc->ep;
>> @@ -1036,6 +1037,30 @@ static void sctp_assoc_bh_rcv(struct work_struct 
>> *work)
>>   state = asoc->state;
>>   subtype = SCTP_ST_CHUNK(chunk->chunk_hdr->type);
>>
>> + /* If the first chunk in the packet is AUTH, do special
>> +  * processing specified in Section 6.3 of SCTP-AUTH spec
>> +  */
>> + if (first_time && subtype.chunk == SCTP_CID_AUTH) {
>> + struct sctp_chunkhdr *next_hdr;
>> +
>> + next_hdr = sctp_inq_peek(inqueue);
>> + if (!next_hdr)
>> + goto normal;
>> +
>> + /* If the next chunk is COOKIE-ECHO, skip the AUTH
>> +  * chunk while saving a pointer to it so we can do
>> +  * Authentication later (during cookie-echo
>> +  * processing).
>> +  */
>> + if (next_hdr->type == SCTP_CID_COOKIE_ECHO) {
>> + chunk->auth_chunk = skb_clone(chunk->skb,
>> +   GFP_ATOMIC);
>> + chunk->auth = 1;
>> + continue;
>> + }
>> + }
>> +
>> +normal:
>>   /* SCTP-AUTH, Section 6.3:
>>*The receiver has a list of chunk types which it expects
>>*to be received only after an AUTH-chunk.  This list has
>> @@ -1074,6 +1099,9 @@ static void sctp_assoc_bh_rcv(struct work_struct *work)
>>   /* If there is an error on chunk, discard this packet. */
>>   if (error && chunk)
>>   chunk->pdiscard = 1;
>> +
>> + if (first_time)
>> + first_time = 0;
>>   }
>>   sctp_association_put(asoc);
>>  }
>> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
>> index 28c070e..c9ae340 100644
>> --- a/net/sctp/sm_statefuns.c
>> +++ b/net/sctp/sm_statefuns.c
>> @@ -153,10 +153,7 @@ static enum sctp_disposition sctp_sf_violation_chunk(
>>   struct sctp_cmd_seq *commands);
>>
>>  static enum sctp_ierror sctp_sf_authenticate(
>> - struct net *net,
>> - const struct sctp_endpoint *ep,
>>   const struct sctp_association *asoc,
>> - const union sctp_subtype type,
>>   struct sctp_chunk *chunk);
>>
>>  static enum sctp_disposition __sctp_sf_do_9_1_abort(
>> @@ -626,6 +623,38 @@ enum sctp_disposition sctp_sf_do_5_1C_ack(struct net 
>> *net,
>>   return SCTP_DISPOSITION_CONSUME;
>>  }
>>
>> +static bool sctp_auth_chunk_verify(struct net *net, struct sctp_chunk 
>> *chunk,
>> +

Re: [RFC PATCH 2/2] net: mac808211: mac802154: use lockdep_assert_in_softirq() instead own warning

2018-05-05 Thread Ingo Molnar


* Sebastian Andrzej Siewior  wrote:

> On 2018-05-04 20:51:32 [+0200], Peter Zijlstra wrote:
> > On Fri, May 04, 2018 at 08:45:39PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2018-05-04 20:32:49 [+0200], Peter Zijlstra wrote:
> > > > On Fri, May 04, 2018 at 07:51:44PM +0200, Sebastian Andrzej Siewior 
> > > > wrote:
> > > > > From: Anna-Maria Gleixner 
> > > > > 
> > > > > The warning in ieee802154_rx() and ieee80211_rx_napi() is there to 
> > > > > ensure
> > > > > the softirq context for the subsequent netif_receive_skb() call. 
> > > > 
> > > > That's not in fact what it does though; so while that might indeed be
> > > > the intent that's not what it does.
> > > 
> > > It was introduced in commit d20ef63d3246 ("mac80211: document
> > > ieee80211_rx() context requirement"):
> > > 
> > > mac80211: document ieee80211_rx() context requirement
> > > 
> > > ieee80211_rx() must be called with softirqs disabled
> > 
> > softirqs disabled, ack that is exactly what it checks.
> > 
> > But afaict the assertion you introduced tests that we are _in_ softirq
> > context, which is not the same.
> 
> indeed, now it clicked. Given what I wrote in the cover letter would you
> be in favour of (a proper) lockdep_assert_BH_disabled() or the cheaper
> local_bh_enable() (assuming the network folks don't mind the cheaper
> version)?

BTW., going by the hardirq variant nomenclature:

lockdep_assert_irqs_enabled();

... the proper name would not be lockdep_assert_BH_disabled(), but:

lockdep_assert_softirqs_disabled();

Thanks,

Ingo

Re: [net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:29 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> We should verify that we can pull the UDP header before we attempt to do
> so. Otherwise if this fails we have no way of knowing and GSO will not work
> correctly.

This already happened in the callers udp[46]_ufo_fragment

Re: [net-next PATCH v2 3/8] udp: Do not pass MSS as parameter to GSO segmentation

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 10:08 PM, Eric Dumazet  wrote:
>
>
> On 05/04/2018 11:29 AM, Alexander Duyck wrote:
>> From: Alexander Duyck 
>>
>> There is no point in passing MSS as a parameter for for the GSO
>> segmentation call as it is already available via the shared info for the
>> skb itself.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>
> Reviewed-by: Eric Dumazet 

Acked-by: Willem de Bruijn

Re: [net-next PATCH v2 1/8] udp: Record gso_segs when supporting UDP segmentation offload

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:28 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> We need to record the number of segments that will be generated when this
> frame is segmented. The expectation is that if gso_size is set then
> gso_segs is set as well. Without this some drivers such as ixgbe get
> confused if they attempt to offload this as they record 0 segments for the
> entire packet instead of the correct value.
>
> Reviewed-by: Eric Dumazet 
> Signed-off-by: Alexander Duyck 
> ---
>  net/ipv4/udp.c |2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index dd3102a37ef9..e07db83b311e 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -793,6 +793,8 @@ static int udp_send_skb(struct sk_buff *skb, struct 
> flowi4 *fl4,
>
> skb_shinfo(skb)->gso_size = cork->gso_size;
> skb_shinfo(skb)->gso_type = SKB_GSO_UDP_L4;
> +   skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len - sizeof(uh),
> +cork->gso_size);
> goto csum_partial;
> }

Acked-by: Willem de Bruijn 

But we have to be careful that both UDP GSO and UFO packets can traverse
qdisc_pkt_len_init. This does seem to fix the UDP GSO case, as it adds the
header size to each segment. But that is odd, as the code precedes UDP
GSO, so must have been intended to estimate UFO size on the wire. Only
external sources can generate UFO. It seems like we need to not add
sizeof(struct udphdr) to hdrlen in the SKB_GSO_DODGY branch.

Re: [net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> This patch allows us to take care of unrolling the first segment and the
> last segment

Only the last segment needs to be unrolled, right?

> of the loop for processing the segmented skb. Part of the
> motivation for this is that it makes it easier to process the fact that the
> first fame and all of the frames in between should be mostly identical
> in terms of header data, and the last frame has differences in the length
> and partial checksum.
>
> In addition I am dropping the header length calculation since we don't
> really need it for anything but the last frame and it can be easily
> obtained by just pulling the data_len and offset of tail from the transport
> header.
>
> Signed-off-by: Alexander Duyck 
> ---
>
> v2: New break-out patch based on one patch from earlier series
>
>  net/ipv4/udp_offload.c |   35 ---
>  1 file changed, 20 insertions(+), 15 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 5c4bb8b9e83e..946d06d2aa0c 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
> struct sock *sk = gso_skb->sk;
> unsigned int sum_truesize = 0;
> -   unsigned int hdrlen;
> struct udphdr *uh;
> unsigned int mss;
> __sum16 check;
> @@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> if (!pskb_may_pull(gso_skb, sizeof(*uh)))
> goto out;
>
> -   hdrlen = gso_skb->data - skb_mac_header(gso_skb);
> skb_pull(gso_skb, sizeof(*uh));
>
> /* clear destructor to avoid skb_segment assigning it to tail */
> @@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
> return segs;
> }
>
> -   uh = udp_hdr(segs);
> +   seg = segs;
> +   uh = udp_hdr(seg);
>
> /* compute checksum adjustment based on old length versus new */
> newlen = htons(sizeof(*uh) + mss);
> @@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
> *gso_skb,
> ((__force u32)uh->len ^ 0x) +
> (__force u32)newlen));
>
> -   for (seg = segs; seg; seg = seg->next) {
> -   uh = udp_hdr(seg);
> +   for (;;) {
> +   seg->destructor = sock_wfree;
> +   seg->sk = sk;
> +   sum_truesize += seg->truesize;
>
> -   /* last packet can be partial gso_size */
> -   if (!seg->next) {
> -   newlen = htons(seg->len - hdrlen);
> -   check = ~csum_fold((__force __wsum)((__force 
> u32)uh->check +
> -   ((__force 
> u32)uh->len ^ 0x) +
> -   (__force 
> u32)newlen));
> -   }
> +   if (!seg->next)
> +   break;

Not critical, but I find replacing

  for (seg = segs; seg; seg = seg->next) {
uh = udp_hdr(seg);
...
  }

with

  uh = udp_hdr(seg);
  for (;;) {
...
if (!seg->next)
  break;

uh = udp_hdr(seg);
  }

much less easy to parse and it inflates patch size. How about just

  - for (seg = segs; seg; seg = seg->next)
  + for( seg = segs; seg->next; seg = seg->next)


>
> uh->len = newlen;
> uh->check = check;
>
> -   seg->destructor = sock_wfree;
> -   seg->sk = sk;
> -   sum_truesize += seg->truesize;
> +   seg = seg->next;
> +   uh = udp_hdr(seg);
> }
>
> +   /* last packet can be partial gso_size, account for that in checksum 
> */
> +   newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
> +  seg->data_len);
> +   check = ~csum_fold((__force __wsum)((__force u32)uh->check +
> +   ((__force u32)uh->len ^ 0x)  +
> +   (__force u32)newlen));
> +   uh->len = newlen;
> +   uh->check = check;
> +
> +   /* update refcount for the packet */
> refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_alloc);
>  out:
> return segs;
>

Re: [net-next PATCH v2 7/8] udp: Do not copy destructor if one is not present

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:31 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> This patch makes it so that if a destructor is not present we avoid trying
> to update the skb socket or any reference counting that would be associated
> with the NULL socket and/or descriptor. By doing this we can support
> traffic coming from another namespace without any issues.
>
> Signed-off-by: Alexander Duyck 

Acked-by: Willem de Bruijn

BUG?: receiving on a packet socket with .sll_protocoll and bridging

2018-05-05 Thread Uwe Kleine-König

Hello,

my eventual goal is to implement MRP and for that I started to program a
bit and stumbled over a problem I don't understand.

For testing purposes I created a veth device pair (veth0 veth1), open a
socket for each of the devices and send packets around between them. In
tcpdump a typical package looks as follows:

10:36:34.755208 ae:a9:da:50:48:db (oui Unknown) > 01:15:e4:00:00:01 (oui 
Unknown), ethertype Unknown (0x88e3), length 58: 
0x:  0001 0212 8000 aea9 da50 48db    .PH.
0x0010:   0589 40f2 6574 6800     @.eth...
0x0020:   0100 0a80 3d38 4c5e ..=8L^..

The socket to receive these packages is opened using:

#define ETH_P_MRP 0x88e3

struct sockaddr_ll sa_ll = {
.sll_family = AF_PACKET,
.sll_protocol = htons(ETH_P_MRP),
.sll_ifindex = if_nametoindex("veth0")
};

fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_MRP));
bind(fd, (struct sockaddr *)&sa_ll, sizeof(sa_ll));

So far everything works fine and I can receive the packets I send.

If now I add veth0 to a bridge (e.g.

ip link add br0 type bridge
ip link set dev veth0 master br0

) and continue to send on veth1 and receive on veth0 I don't receive
the packets any more. The other direction (veth0 sending, veth1
receiving) still works fine. Each of the following changes allow to
receive again:

 a) take veth0 out of the bridge
 b) bind(2) the receiving socket to br0 instead of veth0
 c) use .sll_protocol = htons(ETH_P_ALL) for bind(2)

In the end only c) could be sensible (because I need to know the port
the packet entered the stack and that might well be bridged), but I
wonder why .sll_protocol = htons(ETH_P_MRP) seems to do the right thing
for an unbridged device but not for a bridged one.

Is this a bug or a feature I don't understand?

If someone wants to reproduce this locally, I can simplify my program
and provide it here. I tested this on a Debian 4.15 kernel (x86), but
also with the same symptoms on an arm64 with 4.16 and a dsa switch.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |

Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> print_ht in rhashtable_test calls rht_dereference() with neither
> RCU protection or the mutex.  This triggers an RCU warning.
> So take the mutex to silence the warning.
> 
> Signed-off-by: NeilBrown 

I don't think the mutex is actually needed but since we don't
have rht_dereference_raw and this is just test code:

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This "feature" is unused, undocumented, and untested and so
> doesn't really belong.  If a use for the nulls marker
> is found, all this code would need to be reviewed to
> ensure it works as required.  It would be just as easy to
> just add the code if/when it is needed instead.
> 
> This patch actually fixes a bug too.  The table resizing allows a
> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
> any growth beyond 2^27 is wasteful an ineffective.
> 
> This patch result in NULLS_MARKER(0) being used for all chains,
> and leave the use of rht_is_a_null() to test for it.
> 
> Signed-off-by: NeilBrown 

I disagree.  This is a fundamental requirement for the use of
rhashtable in certain networking systems such as TCP/UDP.  So
we know that there will be a use for this.

As to the bug fix, please separate it out of the patch and resubmit.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> Rather than borrowing one of the bucket locks to
> protect ->future_tbl updates, use cmpxchg().
> This gives more freedom to change how bucket locking
> is implemented.
> 
> Signed-off-by: NeilBrown 

This looks nice.

> - spin_unlock_bh(old_tbl->locks);
> + rcu_assign_pointer(tmp, new_tbl);

Do we need this barrier since cmpxchg is supposed to provide memory
barrier semantics?

> + if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
> + return -EEXIST;

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If two threads run nested_table_alloc() at the same time
> they could both allocate a new table.
> Best case is that one of them will never be freed, leaking memory.
> Worst case is hat entry get stored there before it leaks,
> and the are lost from the table.
> 
> So use cmpxchg to detect the race and free the unused table.
> 
> Fixes: da20420f83ea ("rhashtable: Add nested tables")
> Cc: sta...@vger.kernel.org # 4.11+
> Signed-off-by: NeilBrown 

What about the spinlock that's meant to be held around this
operation?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/8] rhashtable: remove rhashtable_walk_peek()

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This function has a somewhat confused behavior that is not properly
> described by the documentation.
> Sometimes is returns the previous object, sometimes it returns the
> next one.
> Sometimes it changes the iterator, sometimes it doesn't.
> 
> This function is not currently used and is not worth keeping, so
> remove it.
> 
> Signed-off-by: NeilBrown 

I don't have a problem with this.  But it would be good to hear
from Tom Herbert who added this.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> rhashtable_try_insert() currently hold a lock on the bucket in
> the first table, while also locking buckets in subsequent tables.
> This is unnecessary and looks like a hold-over from some earlier
> version of the implementation.
> 
> As insert and remove always lock a bucket in each table in turn, and
> as insert only inserts in the final table, there cannot be any races
> that are not covered by simply locking a bucket in each table in turn.
> 
> When an insert call reaches that last table it can be sure that there
> is no match entry in any other table as it has searched them all, and
> insertion never happens anywhere but in the last table.  The fact that
> code tests for the existence of future_tbl while holding a lock on
> the relevant bucket ensures that two threads inserting the same key
> will make compatible decisions about which is the "last" table.
> 
> This simplifies the code and allows the ->rehash field to be
> discarded.
> We still need a way to ensure that a dead bucket_table is never
> re-linked by rhashtable_walk_stop().  This can be achieved by
> setting the ->size to 1.  This still allows lookup code to work (and
> correctly not find anything) but can never happen on an active bucket
> table (as the minimum size is 4).
> 
> Signed-off-by: NeilBrown 

I'm not convinced this is safe.  There can be multiple tables
in existence.  Unless you hold the lock on the first table, what
is to stop two parallel inserters each from inserting into their
"last" tables which may not be the same table?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If the sequence:
>obj = rhashtable_walk_next(iter);
>rhashtable_walk_stop(iter);
>rhashtable_remove_fast(ht, &obj->head, params);
>rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.
> 
>  This can be made more reliable by keeping the objects in a hash chain
>  sorted by memory address.  A subsequent rhashtable_walk_next()
>  call can reliably find the correct position in the list, and thus
>  find the 'next' object.
> 
>  It is not possible (certainly not so easy) to achieve this with an
>  rhltable as keeping the hash chain in order is not so easy.  When the
>  first object with a given key is removed, it is replaced in the chain
>  with the next object with the same key, and the address of that
>  object may not be correctly ordered.
>  No current user of rhltable_walk_enter() calls
>  rhashtable_walk_start() more than once, so no current code
>  could benefit from a more reliable walk of rhltables.
> 
>  This patch only attempts to improve walks for rhashtables.
>  - a new object is always inserted after the last object with a
>smaller address, or at the start
>  - when rhashtable_walk_start() is called, it records that 'p' is not
>'safe', meaning that it cannot be dereferenced.  The revalidation
>that was previously done here is moved to rhashtable_walk_next()
>  - when rhashtable_walk_next() is called while p is not NULL and not
>safe, it walks the chain looking for the first object with an
>address greater than p and returns that.  If there is none, it moves
>to the next hash chain.
> 
> Signed-off-by: NeilBrown 

I'm a bit torn on this.  On the hand this is definitely an improvement
over the status quo.  On the other this does not work on rhltable and
we do have a way of fixing it for both rhashtable and rhltable.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> rhashtable_walk_prev() returns the object returned by
> the previous rhashtable_walk_next(), providing it is still in the
> table (or was during this grace period).
> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
> have been called since the last rhashtable_walk_next().
> 
> If there have been no calls to rhashtable_walk_next(), or if the
> object is gone from the table, then NULL is returned.
> 
> This can usefully be used in a seq_file ->start() function.
> If the pos is the same as was returned by the last ->next() call,
> then rhashtable_walk_prev() can be used to re-establish the
> current location in the table.  If it returns NULL, then
> rhashtable_walk_next() should be used.
> 
> Signed-off-by: NeilBrown 

I will ack this if Tom is OK with replacing peek with it.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
 wrote:
> From: Alexander Duyck 
>
> This patch is meant to allow us to avoid having to recompute the checksum
> from scratch and have it passed as a parameter.
>
> Instead of taking that approach we can take advantage of the fact that the
> length that was used to compute the existing checksum is included in the
> UDP header. If we cancel that out by adding the value XOR with 0x we
> can then just add the new length in and fold that into the new result.
>
> I think this may be fixing a checksum bug in the original code as well
> since the checksum that was passed included the UDP header in the checksum
> computation, but then excluded it for the adjustment on the last frame. I
> believe this may have an effect on things in the cases where the two differ
> by bits that would result in things crossing the byte boundaries.

The replacement code, below, subtracts original payload size then adds
the new payload size. mss here excludes the udp header size.

> /* last packet can be partial gso_size */
> -   if (!seg->next)
> -   csum_replace2(&uh->check, htons(mss),
> - htons(seg->len - hdrlen - sizeof(*uh)));

Re: [net-next PATCH v2 0/8] UDP GSO Segmentation clean-ups

2018-05-05 Thread Willem de Bruijn

On Fri, May 4, 2018 at 8:28 PM, Alexander Duyck
 wrote:
> This patch set addresses a number of issues I found while sorting out
> enabling UDP GSO Segmentation support for ixgbe/ixgbevf. Specifically there
> were a number of issues related to the checksum and such that seemed to
> cause either minor irregularities or kernel panics in the case of the
> offload request being allowed to traverse between name spaces.

Were you able to traverse GSO packets between network namespace before
adding to NETIF_F_GSO_SOFTWARE? It does appear that veth includes
NETIF_F_GSO_ENCAP_ALL, which also allows GSO.

In either case, it should not be possible for GSO packets to arrive on a veth
device, as that can result in queuing the GSO packet to a recipient socket.
In this regard veth is like loopback and must exclude GSO support.

I'll take a look.

Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-05-05 Thread Salvatore Mesoraca

2018-03-13 21:06 GMT+01:00 Florian Fainelli :
> On 03/13/2018 12:58 PM, Vivien Didelot wrote:
>> Hi Salvatore,
>>
>> Salvatore Mesoraca  writes:
>>
>>> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
>>> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
>>>
>>> [1] https://lkml.org/lkml/2018/3/7/621
>>>
>>> Signed-off-by: Salvatore Mesoraca 
>>
>> NAK.
>>
>> We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
>> and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.
>
> Then this means that we need to allocate a bitmap from the heap, which
> sounds a bit superfluous and could theoretically fail... not sure which
> way is better, but bumping the size to DSA_MAX_PORTS definitively does
> help people working on enabling -Wvla.

Hi Florian,

Should I consider this patch still NAKed or not?
Should I resend the patch with some modifications?

Thank you,

Salvatore

Re: [PATCH 11/40] ipv6/flowlabel: simplify pid namespace lookup

2018-05-05 Thread Eric W. Biederman

Christoph Hellwig  writes:

> The shole seq_file sequence already operates under a single RCU lock pair,
> so move the pid namespace lookup into it, and stop grabbing a reference
> and remove all kinds of boilerplate code.

This is wrong.

Move task_active_pid_ns(current) from open to seq_start actually means
that the results if you pass this proc file between callers the results
will change.  So this breaks file descriptor passing.

Open is a bad place to access current.  In the middle of read/write is
broken.


In this particular instance looking up the pid namespace with
task_active_pid_ns was a personal brain fart.  What the code should be
doing (with an appropriate helper) is:

struct pid_namespace *pid_ns = inode->i_sb->s_fs_info;

Because each mount of proc is bound to a pid namespace.  Looking up the
pid namespace from the super_block is a much better way to go.

Eric



> Signed-off-by: Christoph Hellwig 
> ---
>  net/ipv6/ip6_flowlabel.c | 28 +---
>  1 file changed, 5 insertions(+), 23 deletions(-)
>
> diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
> index c05c4e82a7ca..a9f221d45ef9 100644
> --- a/net/ipv6/ip6_flowlabel.c
> +++ b/net/ipv6/ip6_flowlabel.c
> @@ -754,7 +754,10 @@ static struct ip6_flowlabel *ip6fl_get_idx(struct 
> seq_file *seq, loff_t pos)
>  static void *ip6fl_seq_start(struct seq_file *seq, loff_t *pos)
>   __acquires(RCU)
>  {
> + struct ip6fl_iter_state *state = ip6fl_seq_private(seq);
> +
>   rcu_read_lock_bh();
> + state->pid_ns = task_active_pid_ns(current);
>   return *pos ? ip6fl_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
>  }
>  
> @@ -810,36 +813,15 @@ static const struct seq_operations ip6fl_seq_ops = {
>  
>  static int ip6fl_seq_open(struct inode *inode, struct file *file)
>  {
> - struct seq_file *seq;
> - struct ip6fl_iter_state *state;
> - int err;
> -
> - err = seq_open_net(inode, file, &ip6fl_seq_ops,
> + return seq_open_net(inode, file, &ip6fl_seq_ops,
>  sizeof(struct ip6fl_iter_state));
> -
> - if (!err) {
> - seq = file->private_data;
> - state = ip6fl_seq_private(seq);
> - rcu_read_lock();
> - state->pid_ns = get_pid_ns(task_active_pid_ns(current));
> - rcu_read_unlock();
> - }
> - return err;
> -}
> -
> -static int ip6fl_seq_release(struct inode *inode, struct file *file)
> -{
> - struct seq_file *seq = file->private_data;
> - struct ip6fl_iter_state *state = ip6fl_seq_private(seq);
> - put_pid_ns(state->pid_ns);
> - return seq_release_net(inode, file);
>  }
>  
>  static const struct file_operations ip6fl_seq_fops = {
>   .open   =   ip6fl_seq_open,
>   .read   =   seq_read,
>   .llseek =   seq_lseek,
> - .release=   ip6fl_seq_release,
> + .release=   seq_release_net,
>  };
>  
>  static int __net_init ip6_flowlabel_proc_init(struct net *net)

Re: [PATCH 34/40] atm: simplify procfs code

2018-05-05 Thread Eric W. Biederman

Christoph Hellwig  writes:

> Use remove_proc_subtree to remove the whole subtree on cleanup, and
> unwind the registration loop into individual calls.  Switch to use
> proc_create_seq where applicable.

Can you please explain why you are removing the error handling when
you are unwinding the registration loop?

>  int __init atm_proc_init(void)
>  {
> - static struct atm_proc_entry *e;
> - int ret;
> -
>   atm_proc_root = proc_net_mkdir(&init_net, "atm", init_net.proc_net);
>   if (!atm_proc_root)
> - goto err_out;
> - for (e = atm_proc_ents; e->name; e++) {
> - struct proc_dir_entry *dirent;
> -
> - dirent = proc_create(e->name, 0444,
> -  atm_proc_root, e->proc_fops);
> - if (!dirent)
> - goto err_out_remove;
> - e->dirent = dirent;
> - }
> - ret = 0;
> -out:
> - return ret;
> -
> -err_out_remove:
> - atm_proc_dirs_remove();
> -err_out:
> - ret = -ENOMEM;
> - goto out;
> + return -ENOMEM;
> + proc_create_seq("devices", 0444, atm_proc_root, &atm_dev_seq_ops);
> + proc_create("pvc", 0444, atm_proc_root, &pvc_seq_fops);
> + proc_create("svc", 0444, atm_proc_root, &svc_seq_fops);
> + proc_create("vc", 0444, atm_proc_root, &vcc_seq_fops);
> + return 0;
>  }

These proc entries could fail to register with -ENOMEM if for no other
reason.  Can you justify the removal of the error handling?

Can you please at least mention that removal in your change description
and explain why it is reasonable.

As it sits this is not a semantics preserving transformation, and the
difference is not documented.  Which raises red flags for me.

Eric

Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP

2018-05-05 Thread Julian Anastasov

Hello,

On Thu, 3 May 2018, Martin KaFai Lau wrote:

> > - when exactly we start to use the new PMTU, eg. what happens
> > in case socket caches the route, whether route is killed via
> > dst->obsolete. Or may be while the PMTU expiration is handled
> > per-packet, the PMTU change is noticed only on ICMP...
> Before sk can reuse its dst cache, the sk will notice
> its dst cache is no longer valid by calling dst_check().
> dst_check() should return NULL which is one of the side
> effect of the earlier update_pmtu().  This dst_check()
> is usually only called when the sk needs to do output,
> so the new PMTU route (i.e. the RTF_CACHE IPv6 route)
> only have effect to the later packets.

I checked again the code and it looks like sockets
are forced to use new exceptional route (RTF_CACHE/fnhe) via
dst_check only when the PMTU update should move them away
from old non-exceptional routes. Later, if PMTU is
reduced/updated this is noticed for every packet via dst_mtu,
as in the case with TCP.

So, except the RTF_LOCAL check in __ip6_rt_update_pmtu
we should have no other issues. Only one minor bit is strange to me,
why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu
allows it when returning true... 

Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for
routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we
restrict rt6_do_update_pmtu only to RTF_CACHE routes?

if (!rt6_cache_allowed_for_pmtu(rt6)) {
-   rt6_do_update_pmtu(rt6, mtu);
-   /* update rt6_ex->stamp for cache */
-   if (rt6->rt6i_flags & RTF_CACHE)
+   if (rt6->rt6i_flags & RTF_CACHE) {
+   rt6_do_update_pmtu(rt6, mtu);
+   /* update rt6_ex->stamp for cache */
rt6_update_exception_stamp_rt(rt6);
+   }
} else if (daddr) {

Regards

--
Julian Anastasov

Re: [PATCH 38/40] ide: remove ide_driver_proc_write

2018-05-05 Thread Eric W. Biederman

Christoph Hellwig  writes:

> The driver proc file hasn't been writeable for a long time, so this is
> just dead code.

It is possible to chmod this file to get at the write method.  Not that
I think anyone does.

It looks like this code was merged in 2.3.99-pre1 with permissions
S_IFREG|S_IRUGO so I don't think the write support was ever finished.

That cap_capable in the write method looks down right scary/buggy.

Acked-by: "Eric W. Biederman" 

Eric



>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/ide/ide-proc.c | 46 --
>  1 file changed, 46 deletions(-)
>
> diff --git a/drivers/ide/ide-proc.c b/drivers/ide/ide-proc.c
> index 863db44c7916..b3b8b8822d6a 100644
> --- a/drivers/ide/ide-proc.c
> +++ b/drivers/ide/ide-proc.c
> @@ -528,58 +528,12 @@ static int ide_driver_proc_open(struct inode *inode, 
> struct file *file)
>   return single_open(file, ide_driver_proc_show, PDE_DATA(inode));
>  }
>  
> -static int ide_replace_subdriver(ide_drive_t *drive, const char *driver)
> -{
> - struct device *dev = &drive->gendev;
> - int ret = 1;
> - int err;
> -
> - device_release_driver(dev);
> - /* FIXME: device can still be in use by previous driver */
> - strlcpy(drive->driver_req, driver, sizeof(drive->driver_req));
> - err = device_attach(dev);
> - if (err < 0)
> - printk(KERN_WARNING "IDE: %s: device_attach error: %d\n",
> - __func__, err);
> - drive->driver_req[0] = 0;
> - if (dev->driver == NULL) {
> - err = device_attach(dev);
> - if (err < 0)
> - printk(KERN_WARNING
> - "IDE: %s: device_attach(2) error: %d\n",
> - __func__, err);
> - }
> - if (dev->driver && !strcmp(dev->driver->name, driver))
> - ret = 0;
> -
> - return ret;
> -}
> -
> -static ssize_t ide_driver_proc_write(struct file *file, const char __user 
> *buffer,
> -  size_t count, loff_t *pos)
> -{
> - ide_drive_t *drive = PDE_DATA(file_inode(file));
> - char name[32];
> -
> - if (!capable(CAP_SYS_ADMIN))
> - return -EACCES;
> - if (count > 31)
> - count = 31;
> - if (copy_from_user(name, buffer, count))
> - return -EFAULT;
> - name[count] = '\0';
> - if (ide_replace_subdriver(drive, name))
> - return -EINVAL;
> - return count;
> -}
> -
>  static const struct file_operations ide_driver_proc_fops = {
>   .owner  = THIS_MODULE,
>   .open   = ide_driver_proc_open,
>   .read   = seq_read,
>   .llseek = seq_lseek,
>   .release= single_release,
> - .write  = ide_driver_proc_write,
>  };
>  
>  static int ide_media_proc_show(struct seq_file *m, void *v)

Re: [PATCH] dt-bindings: can: rcar_can: Fix R8A7796 SoC name

2018-05-05 Thread Marc Kleine-Budde

On 05/03/2018 03:02 PM, Geert Uytterhoeven wrote:
> R8A7796 is R-Car M3-W.
> 
> Signed-off-by: Geert Uytterhoeven 

Applied to linux-can.

thnx,
Marc

-- 
Pengutronix e.K.  | Marc Kleine-Budde   |
Industrial Linux Solutions| Phone: +49-231-2826-924 |
Vertretung West/Dortmund  | Fax:   +49-5121-206917- |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function

2018-05-05 Thread Guenter Roeck

On Thu, Apr 12, 2018 at 09:49:11AM +, Israel Rukshin wrote:
> Adding the vector offset when calling to mlx5_vector2eqn() is wrong.
> This is because mlx5_vector2eqn() checks if EQ index is equal to vector number
> and the fact that the internal completion vectors that mlx5 allocates
> don't get an EQ index.
> 
> The second problem here is that using effective_affinity_mask gives the same
> CPU for different vectors.
> This leads to unmapped queues when calling it from blk_mq_rdma_map_queues().
> This doesn't happen when using affinity_hint mask.
> 
Except that affinity_hint is only defined if SMP is enabled. Without:

include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
include/linux/mlx5/driver.h:1299:13: error:
‘struct irq_desc’ has no member named ‘affinity_hint’

Note that this is the only use of affinity_hint outside kernel/irq.
Don't other drivers have similar problems ?

Guenter

> Fixes: 2572cf57d75a ("mlx5: fix mlx5_get_vector_affinity to start from 
> completion vector 0")
> Fixes: 05e0cc84e00c ("net/mlx5: Fix get vector affinity helper function")
> Signed-off-by: Israel Rukshin 
> Reviewed-by: Max Gurtovoy 
> Reviewed-by: Sagi Grimberg 
> ---
>  drivers/infiniband/hw/mlx5/main.c |  2 +-
>  include/linux/mlx5/driver.h   | 12 +++-
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/main.c 
> b/drivers/infiniband/hw/mlx5/main.c
> index daa919e5a442..241cf4ff9901 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -4757,7 +4757,7 @@ mlx5_ib_get_vector_affinity(struct ib_device *ibdev, 
> int comp_vector)
>  {
>   struct mlx5_ib_dev *dev = to_mdev(ibdev);
>  
> - return mlx5_get_vector_affinity(dev->mdev, comp_vector);
> + return mlx5_get_vector_affinity_hint(dev->mdev, comp_vector);
>  }
>  
>  /* The mlx5_ib_multiport_mutex should be held when calling this function */
> diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
> index 767d193c269a..2a156c5dfadd 100644
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -1284,25 +1284,19 @@ enum {
>  };
>  
>  static inline const struct cpumask *
> -mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int vector)
> +mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
>  {
> - const struct cpumask *mask;
>   struct irq_desc *desc;
>   unsigned int irq;
>   int eqn;
>   int err;
>  
> - err = mlx5_vector2eqn(dev, MLX5_EQ_VEC_COMP_BASE + vector, &eqn, &irq);
> + err = mlx5_vector2eqn(dev, vector, &eqn, &irq);
>   if (err)
>   return NULL;
>  
>   desc = irq_to_desc(irq);
> -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> - mask = irq_data_get_effective_affinity_mask(&desc->irq_data);
> -#else
> - mask = desc->irq_common_data.affinity;
> -#endif
> - return mask;
> + return desc->affinity_hint;
>  }
>  
>  #endif /* MLX5_DRIVER_H */
> -- 
> 2.7.4

Re: [PATCH iproute2] rdma: fix header files

2018-05-05 Thread David Ahern

On 5/4/18 10:58 PM, Stephen Hemminger wrote:
> On Fri, 4 May 2018 16:13:07 -0600
> David Ahern  wrote:
> 
>> On 5/4/18 3:56 PM, Stephen Hemminger wrote:
>>> All user api headers in iproute2 should be in include/uapi
>>> so that script can be used to put correct sanitized kernel headers
>>> there. And the header files for rdma must be a complete set; if one
>>> header file includes another, all must be present.
>>>
>>> This fixes build on older distributions, and Windows Services
>>> for Linux.
>>>
>>> Signed-off-by: Stephen Hemminger 
>>> ---
>>>  include/uapi/rdma/ib_user_sa.h|   77 ++
>>>  include/uapi/rdma/ib_user_verbs.h | 1210 +
>>>  .../uapi/rdma/rdma_netlink.h  |   13 +
>>>  .../uapi/rdma/rdma_user_cm.h  |6 +-
>>>  4 files changed, 1303 insertions(+), 3 deletions(-)
>>>  create mode 100644 include/uapi/rdma/ib_user_sa.h
>>>  create mode 100644 include/uapi/rdma/ib_user_verbs.h
>>>  rename {rdma/include => include}/uapi/rdma/rdma_netlink.h (95%)
>>>  rename {rdma/include => include}/uapi/rdma/rdma_user_cm.h (98%)
>>>   
>>
>> Stephen:
>>
>> Per a recent discussion the RDMA folks need to take ownership of the
>> uapi files. RDMA features do not hit Dave's net-next tree so the rdma
>> code can never hit iproute2-next during a dev cycle.
> 
> I want all uapi headers in include/uapi because it avoids possible overlap 
> problems,
> During the linux-net/linus release cycle they should match what is Linus's 
> tree.
> 
> During the net-next they can come from two sources.
> 

That creates extra work for me for no reason.

You state above "user api headers in iproute2 should be in include/uapi
so that script can be used to put correct sanitized kernel headers there."

With RDMA's development cycle that will *never* happen. With the
exception of RDMA, all iproute2 features go through net-next and it is
the right tree to pull updates from. Every time I sync from net-next the
header files for rdma will have to be ignored, so they will never be
updated through this mechanism which means the stated goal is not
achievable.

As for linux-next, I will not sync header files to a tree that
disappears; it breaks all traceability. Further, it seems to me that it
does not really solve the problem. I forget the steps now but RDMA
features have to hit some development tree before going to Linus, so
there will be a delay with the headers.

Back in March we discussed options. iproute2 is nothing more than a
delivery vehicle for rdmatool. Since it breaks everything else about the
iproute2 and net-next association, the simplest option for everyone is
for the rdma group to control syncing their own headers and putting them
under rdma directory.

[PATCH net] tls: fix use after free in tls_sk_proto_close

2018-05-05 Thread Eric Dumazet

syzbot reported a use-after-free in tls_sk_proto_close

Add a boolean value to cleanup a bit this function.

BUG: KASAN: use-after-free in tls_sk_proto_close+0x8ab/0x9c0 
net/tls/tls_main.c:297
Read of size 1 at addr 8801ae40a858 by task syz-executor363/4503

CPU: 0 PID: 4503 Comm: syz-executor363 Not tainted 4.17.0-rc3+ #34
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 tls_sk_proto_close+0x8ab/0x9c0 net/tls/tls_main.c:297
 inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:460
 sock_release+0x96/0x1b0 net/socket.c:594
 sock_close+0x16/0x20 net/socket.c:1149
 __fput+0x34d/0x890 fs/file_table.c:209
 fput+0x15/0x20 fs/file_table.c:243
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x1aee/0x2730 kernel/exit.c:865
 do_group_exit+0x16f/0x430 kernel/exit.c:968
 get_signal+0x886/0x1960 kernel/signal.c:2469
 do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
 exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4457b9
RSP: 002b:7fdf4d766da8 EFLAGS: 0246 ORIG_RAX: 00ca
RAX: fe00 RBX: 006dac3c RCX: 004457b9
RDX:  RSI:  RDI: 006dac3c
RBP:  R08:  R09: 
R10:  R11: 0246 R12: 006dac38
R13: 3692738801137283 R14: 6bf92c39443c4c1d R15: 0006

Allocated by task 4498:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x780 mm/slab.c:3620
 kmalloc include/linux/slab.h:512 [inline]
 kzalloc include/linux/slab.h:701 [inline]
 create_ctx net/tls/tls_main.c:521 [inline]
 tls_init+0x1f9/0xb00 net/tls/tls_main.c:633
 tcp_set_ulp+0x1bc/0x520 net/ipv4/tcp_ulp.c:153
 do_tcp_setsockopt.isra.39+0x44a/0x2600 net/ipv4/tcp.c:2588
 tcp_setsockopt+0xc1/0xe0 net/ipv4/tcp.c:2893
 sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3039
 __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
 __do_sys_setsockopt net/socket.c:1914 [inline]
 __se_sys_setsockopt net/socket.c:1911 [inline]
 __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4503:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x260 mm/slab.c:3813
 tls_sw_free_resources+0x2a3/0x360 net/tls/tls_sw.c:1037
 tls_sk_proto_close+0x67c/0x9c0 net/tls/tls_main.c:288
 inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
 inet6_release+0x50/0x70 net/ipv6/af_inet6.c:460
 sock_release+0x96/0x1b0 net/socket.c:594
 sock_close+0x16/0x20 net/socket.c:1149
 __fput+0x34d/0x890 fs/file_table.c:209
 fput+0x15/0x20 fs/file_table.c:243
 task_work_run+0x1e4/0x290 kernel/task_work.c:113
 exit_task_work include/linux/task_work.h:22 [inline]
 do_exit+0x1aee/0x2730 kernel/exit.c:865
 do_group_exit+0x16f/0x430 kernel/exit.c:968
 get_signal+0x886/0x1960 kernel/signal.c:2469
 do_signal+0x98/0x2040 arch/x86/kernel/signal.c:810
 exit_to_usermode_loop+0x28a/0x310 arch/x86/entry/common.c:162
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
 do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801ae40a800
 which belongs to the cache kmalloc-256 of size 256
The buggy address is located 88 bytes inside of
 256-byte region [8801ae40a800, 8801ae40a900)
The buggy address belongs to the page:
page:ea0006b90280 count:1 mapcount:0 mapping:8801ae40a080 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 8801ae40a080  0001000c
raw: ea0006bea9e0 ea0006bc94a0 8801da8007c0 
page dumped because: kasan: bad access detected

Fixes: dd0bed1665d6 ("tls: support for Inline tls record")
Signed-off-by: Eric Dumazet 
Cc: Atul Gupta 
Cc: Steve Wise 
Cc: Ilya Lesokhin 
Cc: Aviad Yehezkel 
Cc: Dave Watson 
Reported-by: syzbot 
---
 net/tls/tls_main.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(

Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-05-05 Thread Andrew Lunn

On Sat, May 05, 2018 at 12:36:36PM +0200, Salvatore Mesoraca wrote:
> 2018-03-13 21:06 GMT+01:00 Florian Fainelli :
> > On 03/13/2018 12:58 PM, Vivien Didelot wrote:
> >> Hi Salvatore,
> >>
> >> Salvatore Mesoraca  writes:
> >>
> >>> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
> >>> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
> >>>
> >>> [1] https://lkml.org/lkml/2018/3/7/621
> >>>
> >>> Signed-off-by: Salvatore Mesoraca 
> >>
> >> NAK.
> >>
> >> We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
> >> and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.
> >
> > Then this means that we need to allocate a bitmap from the heap, which
> > sounds a bit superfluous and could theoretically fail... not sure which
> > way is better, but bumping the size to DSA_MAX_PORTS definitively does
> > help people working on enabling -Wvla.
> 
> Hi Florian,
> 
> Should I consider this patch still NAKed or not?
> Should I resend the patch with some modifications?

Hi Salvatore

We have been removing all uses of DSA_MAX_PORTS. I don't particularly
like arbitrary limits on how many ports a switch can have, or how many
switches a board can have.

So i would prefer to not use DSA_MAX_PORTS here.

You could make the bitmap part of the dsa_switch structure. This is
allocated by dsa_switch_alloc() and is passed the number of ports.
Doing the allocation there means you don't need to worry about it
failing in dsa_switch_mdb_add() or dsa_switch_vlan_add().

Andrew

Re: [PATCH net-next v2 02/13] net: phy: sfp: handle non-wired SFP connectors

2018-05-05 Thread Thomas Petazzoni

Hello,

On Fri, 4 May 2018 19:23:37 +0200, Antoine Tenart wrote:

> On Fri, May 04, 2018 at 10:04:48AM -0700, Florian Fainelli wrote:
> > On 05/04/2018 06:56 AM, Antoine Tenart wrote:  
> > > SFP connectors can be solder on a board without having any of their pins
> > > (LOS, i2c...) wired. In such cases the SFP link state cannot be guessed,
> > > and the overall link status reporting is left to other layers.
> > > 
> > > In order to achieve this, a new SFP_DEV status is added, named UNKNOWN.
> > > This mode is set when it is not possible for the SFP code to get the
> > > link status and as a result the link status is reported to be always UP
> > > from the SFP point of view.  
> > 
> > Why represent the SFP in Device Tree then? Why not just declare this is
> > a fixed link which would avoid having to introduce this "unknown" state.  
> 
> The other solution would have been to represent this as a fixed-link.
> But such a representation would report the link as being up all the
> time, which is something we wanted to avoid as the GoP in PPv2 can
> report some link status. This is achieved using SFP+phylink+PPv2.
> 
> And representing the SFP cage in the device tree, although it's a
> "dummy" one, helps describing the hardware.

Just to add to this: the board physically has a SFP cage, and a cable
can be connected to it, or not. So it is absolutely not a fixed link
(cable can be connected or not) and it really is a SFP cage.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

[PATCH v2 1/1] drivers core: multi-threading device shutdown

2018-05-05 Thread Pavel Tatashin

When system is rebooted, halted or kexeced device_shutdown() is
called.

This function shuts down every single device by calling either:

dev->bus->shutdown(dev)
dev->driver->shutdown(dev)

Even on a machine with just a moderate amount of devices, device_shutdown()
may take multiple seconds to complete. This is because many devices require
a specific delays to perform this operation.

Here is a sample analysis of time it takes to call device_shutdown() on a
two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.

device_shutdown 2.95s
-
 mlx4_shutdown  1.14s
 megasas_shutdown   0.24s
 ixgbe_shutdown 0.37s x 4 (four ixgbe devices on this machine).
 the rest   0.09s

In mlx4 we spent the most time, but that is because there is a 1 second
sleep, which is defined by hardware specifications:
mlx4_shutdown
 mlx4_unload_one
  mlx4_free_ownership
   msleep(1000)

With megasas we spend a quarter of a second, but sometimes longer (up-to
0.5s) in this path:

megasas_shutdown
  megasas_flush_cache
megasas_issue_blocked_cmd
  wait_event_timeout

Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
is spread all over the place, with bigger offenders:

ixgbe_shutdown
  __ixgbe_shutdown
ixgbe_close_suspend
  ixgbe_down
ixgbe_init_hw_generic
  ixgbe_reset_hw_X540
msleep(100);0.104483472
ixgbe_get_san_mac_addr_generic  0.048414851
ixgbe_get_wwn_prefix_generic0.048409893
  ixgbe_start_hw_X540
ixgbe_start_hw_generic
  ixgbe_clear_hw_cntrs_generic  0.048581502
  ixgbe_setup_fc_generic0.024225800

All the ixgbe_*generic functions end-up calling:
ixgbe_read_eerd_X540()
  ixgbe_acquire_swfw_sync_X540
usleep_range(5000, 6000);
  ixgbe_release_swfw_sync_X540
usleep_range(5000, 6000);

While these are short sleeps, they end-up calling them over 24 times!
24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices. Also we have
four msleep(100). Totaling to:  0.928s

While we should keep optimizing the individual device drivers, in some
cases this is simply a hardware property that forces a specific delay, and
we must wait.

So, the solution for this problem is to shutdown devices in parallel.
However, we must shutdown children before shutting down parents, so parent
device must wait for its children to finish.

With this patch, on the same machine devices_shutdown() takes 1.142s, and
without mlx4 one second delay only 0.38s

Signed-off-by: Pavel Tatashin 
---
 drivers/base/core.c | 275 
 1 file changed, 225 insertions(+), 50 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..110d7970deef 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev,
return *tmp = s;
 }
 
+/**
+ * device_children_count - device children count
+ * @parent: parent struct device.
+ *
+ * Returns number of children for this device or 0 if none.
+ */
+static int device_children_count(struct device *parent)
+{
+   struct klist_iter i;
+   int children = 0;
+
+   if (!parent->p)
+   return 0;
+
+   klist_iter_init(&parent->p->klist_children, &i);
+   while (next_device(&i))
+   children++;
+   klist_iter_exit(&i);
+
+   return children;
+}
+
+/**
+ * device_get_child_by_index - Return child using the provided index.
+ * @parent: parent struct device.
+ * @index:  Index of the child, where 0 is the first child in the children 
list,
+ * and so on.
+ *
+ * Returns child or NULL if child with this index is not present.
+ */
+static struct device *
+device_get_child_by_index(struct device *parent, int index)
+{
+   struct klist_iter i;
+   struct device *dev = NULL, *d;
+   int child_index = 0;
+
+   if (!parent->p || index < 0)
+   return NULL;
+
+   klist_iter_init(&parent->p->klist_children, &i);
+   while ((d = next_device(&i))) {
+   if (child_index == index) {
+   dev = d;
+   break;
+   }
+   child_index++;
+   }
+   klist_iter_exit(&i);
+
+   return dev;
+}
+
 /**
  * device_for_each_child - device child iterator.
  * @parent: parent struct device.
@@ -2765,71 +2819,192 @@ int device_move(struct device *dev, struct device 
*new_parent,
 }
 EXPORT_SYMBOL_GPL(device_move);
 
+/*
+ * device_shutdown_one - call ->shutdown() for the device passed as
+ * argument.
+ */
+static void device_shutdown_one(struct device *dev)
+{
+   /* Don't allow any mo

Re: [PATCH 7/8] rhashtable: add rhashtable_walk_prev()

2018-05-05 Thread Tom Herbert

On Sat, May 5, 2018 at 2:43 AM, Herbert Xu  wrote:
> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> rhashtable_walk_prev() returns the object returned by
>> the previous rhashtable_walk_next(), providing it is still in the
>> table (or was during this grace period).
>> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
>> have been called since the last rhashtable_walk_next().
>>
>> If there have been no calls to rhashtable_walk_next(), or if the
>> object is gone from the table, then NULL is returned.
>>
>> This can usefully be used in a seq_file ->start() function.
>> If the pos is the same as was returned by the last ->next() call,
>> then rhashtable_walk_prev() can be used to re-establish the
>> current location in the table.  If it returns NULL, then
>> rhashtable_walk_next() should be used.
>>
>> Signed-off-by: NeilBrown 
>
> I will ack this if Tom is OK with replacing peek with it.
>
I'm not following why this is any better than peek. The point of
having rhashtable_walk_peek is to to allow the caller to get then
current element not the next one. This is needed when table is read in
multiple parts and we need to pick up with what was returned from the
last call to rhashtable_walk_next (which apparently is what this patch
is also trying to do).

There is one significant difference in that peek will return the
element in the table regardless of where the iterator is at (this is
why peek can move the iterator) and only returns NULL at end of table.
As mentioned above rhashtable_walk_prev can return NULL so then caller
needs and so rhashtable_walk_next "should be used" to get the previous
element. Doing a peek is a lot cleaner and more straightforward API in
this regard.

Tom

> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH v2 0/1] multi-threading device shutdown

2018-05-05 Thread Pavel Tatashin

Changelog

v1 - v2
- It turns out we cannot lock more than MAX_LOCK_DEPTH by a single
  thread. (By default this value is 48), and is used to detect
  deadlocks. So, I re-wrote the code to only lock one devices per
  thread instead of pre-locking all devices by the main thread.
- Addressed comments from Tobin C. Harding.
- As suggested by Alexander Duyck removed ixgbe changes. It can be
  done as a separate work scaling RTNL mutex.

Do a faster shutdown by calling dev->*->shutdown(dev) in parallel.
device_shutdown() calls these functions for every single device but
only using one thread.

Since, nothing else is running on the machine by the device_shutdown()
s called, there is no reason not to utilize all the available CPU
resources.

Pavel Tatashin (1):
  drivers core: multi-threading device shutdown

 drivers/base/core.c | 275 
 1 file changed, 225 insertions(+), 50 deletions(-)

-- 
2.17.0

Re: [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper

2018-05-05 Thread Alexei Starovoitov

On Sat, May 05, 2018 at 12:48:24AM -0400, Jann Horn wrote:
> On Thu, May 3, 2018 at 12:36 AM, Alexei Starovoitov  wrote:
> > Introduce helper:
> > int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
> > struct umh_info {
> >struct file *pipe_to_umh;
> >struct file *pipe_from_umh;
> >pid_t pid;
> > };
> >
> > that GPLed kernel modules (signed or unsigned) can use it to execute part
> > of its own data as swappable user mode process.
> >
> > The kernel will do:
> > - mount "tmpfs"
> > - allocate a unique file in tmpfs
> > - populate that file with [data, data + len] bytes
> > - user-mode-helper code will do_execve that file and, before the process
> >   starts, the kernel will create two unix pipes for bidirectional
> >   communication between kernel module and umh
> > - close tmpfs file, effectively deleting it
> > - the fork_usermode_blob will return zero on success and populate
> >   'struct umh_info' with two unix pipes and the pid of the user process
> >
> > As the first step in the development of the bpfilter project
> > the fork_usermode_blob() helper is introduced to allow user mode code
> > to be invoked from a kernel module. The idea is that user mode code plus
> > normal kernel module code are built as part of the kernel build
> > and installed as traditional kernel module into distro specified location,
> > such that from a distribution point of view, there is
> > no difference between regular kernel modules and kernel modules + umh code.
> > Such modules can be signed, modprobed, rmmod, etc. The use of this new 
> > helper
> > by a kernel module doesn't make it any special from kernel and user space
> > tooling point of view.
> [...]
> > +static struct vfsmount *umh_fs;
> > +
> > +static int init_tmpfs(void)
> > +{
> > +   struct file_system_type *type;
> > +
> > +   if (umh_fs)
> > +   return 0;
> > +   type = get_fs_type("tmpfs");
> > +   if (!type)
> > +   return -ENODEV;
> > +   umh_fs = kern_mount(type);
> > +   if (IS_ERR(umh_fs)) {
> > +   int err = PTR_ERR(umh_fs);
> > +
> > +   put_filesystem(type);
> > +   umh_fs = NULL;
> > +   return err;
> > +   }
> > +   return 0;
> > +}
> 
> Should init_tmpfs() be holding some sort of mutex if it's fiddling
> with `umh_fs`? The current code only calls it in initcall context, but
> if that ever changes and two processes try to initialize the tmpfs at
> the same time, a few things could go wrong.

I thought that module loading is serialized, so calls to
fork_usermode_blob() will be serialized as well, but looking at the code
again that doesn't seem to be the case, so need to revisit not only
this function, but the rest of it too.

> I guess Luis' suggestion (putting a call to init_tmpfs() in
> do_basic_setup()) might be the easiest way to get rid of that problem.

I still think that two mounts where umh mount is dynamic is cleaner.
Why waste the mount if no module uses this helper?
I'm thinking to wrap init_tmpfs into DO_ONCE instead or use a mutex.
Looks like shmem_file_setup_with_mnt() can be called in parallel
on the same mount, so that should be fine.

Re: WARNING in kernfs_add_one

2018-05-05 Thread Greg KH

On Sat, May 05, 2018 at 08:47:02AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:8fb11a9a8d51 net/ipv6: rename rt6_next to fib6_next
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=14b2723780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c416c61f3cd96be
> dashboard link: https://syzkaller.appspot.com/bug?extid=df47f81c226b31d89fb1
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=172fb3e780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16552e5780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+df47f81c226b31d89...@syzkaller.appspotmail.com
> 
> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534
> R10:  R11: 0246 R12: 
> R13: 0006 R14:  R15: 
> [ cut here ]
> kernfs: ns required in 'ieee80211' for 'phy3'

That's interesting, this looks like a netfilter bug (adding netdev to
the report here.)

Yes, we can "tone down" the kernfs warning to just be an error message
in the log, but there might be something worse going on here.

Network developers, any idea?  Rest of the callback chain is here:


> WARNING: CPU: 0 PID: 4538 at fs/kernfs/dir.c:759 kernfs_add_one+0x406/0x4d0
> fs/kernfs/dir.c:758
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 0 PID: 4538 Comm: syz-executor486 Not tainted 4.17.0-rc3+ #33
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  panic+0x22f/0x4de kernel/panic.c:184
>  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>  report_bug+0x252/0x2d0 lib/bug.c:186
>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
>  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> RIP: 0010:kernfs_add_one+0x406/0x4d0 fs/kernfs/dir.c:758
> RSP: 0018:8801ca9eece0 EFLAGS: 00010286
> RAX: 002d RBX: 87d5cee0 RCX: 8160ba7d
> RDX:  RSI: 81610731 RDI: 8801ca9ee840
> RBP: 8801ca9eed20 R08: 8801d9538500 R09: 0006
> R10: 8801d9538500 R11:  R12: 8801ad1cb6c0
> R13: 885da640 R14: 0020 R15: 
>  kernfs_create_link+0x112/0x180 fs/kernfs/symlink.c:41
>  sysfs_do_create_link_sd.isra.2+0x90/0x130 fs/sysfs/symlink.c:43
>  sysfs_do_create_link fs/sysfs/symlink.c:79 [inline]
>  sysfs_create_link+0x65/0xc0 fs/sysfs/symlink.c:91
>  device_add_class_symlinks drivers/base/core.c:1612 [inline]
>  device_add+0x7a0/0x16d0 drivers/base/core.c:1810
>  wiphy_register+0x178a/0x2430 net/wireless/core.c:806
>  ieee80211_register_hw+0x13cd/0x35d0 net/mac80211/main.c:1047
>  mac80211_hwsim_new_radio+0x1d9b/0x3410
> drivers/net/wireless/mac80211_hwsim.c:2772
>  hwsim_new_radio_nl+0x7a7/0xa60 drivers/net/wireless/mac80211_hwsim.c:3246
>  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
>  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
>  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
>  sock_sendmsg_nosec net/socket.c:629 [inline]
>  sock_sendmsg+0xd5/0x120 net/socket.c:639
>  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
>  __sys_sendmsg+0x115/0x270 net/socket.c:2155
>  __do_sys_sendmsg net/socket.c:2164 [inline]
>  __se_sys_sendmsg net/socket.c:2162 [inline]
>  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4404c9
> RSP: 002b:7fff808f3e08 EFLAGS: 0246 ORIG_RAX: 002e
> RAX: ffda RBX:  RCX: 004404c9
> RDX:  RSI: 20b3dfc8 RDI: 0005
> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534
> R10:  R11: 0246 R12: 
> R13: 0006 R14:  R15: 
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..


Any ideas?

thanks,

greg k-h

Re: [PATCH rdma-next] MAINTAINERS: Remove bouncing @mellanox.com addresses

2018-05-05 Thread Matan Barak

On Fri, May 4, 2018 at 12:37 AM, Doug Ledford  wrote:
> On 5/3/2018 5:11 PM, Or Gerlitz wrote:
>> On Thu, May 3, 2018 at 9:37 PM, LR wrote:
>>
>>>  MELLANOX MLX5 core VPI driver
>>>  M: Saeed Mahameed 
>>> -M: Matan Barak 
>>
>> Goodbye Matan!
>>
>> You were a long time developer, maintainer, hacker and a very deeply 
>> thinking,
>> pleasant, nice and open person in our team, enjoy your new adventures and 
>> thanks
>> a lot for your long time contributions to the upstream kernel
>
> Indeed, Matan was always a pleasure to work with.  Best of luck on
> whatever you are doing next!
>
>

Thanks Or and Doug!
It was a real pleasure being part of the RDMA and netdev communities,
while seeing the growth of the RDMA community in the last few years
and the importance it gained.
I've enjoyed collaborating with you guys, adding features and
capabilites to our subsystem and drivers.
Hope our paths will cross again sometime in the future:)

Re: [net-next PATCH v2 2/8] udp: Verify that pulling UDP header in GSO segmentation doesn't fail

2018-05-05 Thread Alexander Duyck

On Sat, May 5, 2018 at 1:12 AM, Willem de Bruijn
 wrote:
> On Fri, May 4, 2018 at 8:29 PM, Alexander Duyck
>  wrote:
>> From: Alexander Duyck 
>>
>> We should verify that we can pull the UDP header before we attempt to do
>> so. Otherwise if this fails we have no way of knowing and GSO will not work
>> correctly.
>
> This already happened in the callers udp[46]_ufo_fragment

Ah. Okay I didn't see that. I may add a comment somewhere indicating
that is the case as it is a bit buried with all the calls.

Thanks.

- Alex

Re: [net-next PATCH v2 5/8] udp: Partially unroll handling of first segment and last segment

2018-05-05 Thread Alexander Duyck

On Sat, May 5, 2018 at 1:37 AM, Willem de Bruijn
 wrote:
> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
>  wrote:
>> From: Alexander Duyck 
>>
>> This patch allows us to take care of unrolling the first segment and the
>> last segment
>
> Only the last segment needs to be unrolled, right?

I need the first segment as I have to get access to the UDP header to
recompute what the checksum should actually be for the first frame and
all the frames in between.

>> of the loop for processing the segmented skb. Part of the
>> motivation for this is that it makes it easier to process the fact that the
>> first fame and all of the frames in between should be mostly identical
>> in terms of header data, and the last frame has differences in the length
>> and partial checksum.
>>
>> In addition I am dropping the header length calculation since we don't
>> really need it for anything but the last frame and it can be easily
>> obtained by just pulling the data_len and offset of tail from the transport
>> header.
>>
>> Signed-off-by: Alexander Duyck 
>> ---
>>
>> v2: New break-out patch based on one patch from earlier series
>>
>>  net/ipv4/udp_offload.c |   35 ---
>>  1 file changed, 20 insertions(+), 15 deletions(-)
>>
>> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
>> index 5c4bb8b9e83e..946d06d2aa0c 100644
>> --- a/net/ipv4/udp_offload.c
>> +++ b/net/ipv4/udp_offload.c
>> @@ -193,7 +193,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> struct sk_buff *seg, *segs = ERR_PTR(-EINVAL);
>> struct sock *sk = gso_skb->sk;
>> unsigned int sum_truesize = 0;
>> -   unsigned int hdrlen;
>> struct udphdr *uh;
>> unsigned int mss;
>> __sum16 check;
>> @@ -206,7 +205,6 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> if (!pskb_may_pull(gso_skb, sizeof(*uh)))
>> goto out;
>>
>> -   hdrlen = gso_skb->data - skb_mac_header(gso_skb);
>> skb_pull(gso_skb, sizeof(*uh));
>>
>> /* clear destructor to avoid skb_segment assigning it to tail */
>> @@ -219,7 +217,8 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> return segs;
>> }
>>
>> -   uh = udp_hdr(segs);
>> +   seg = segs;
>> +   uh = udp_hdr(seg);
>>
>> /* compute checksum adjustment based on old length versus new */
>> newlen = htons(sizeof(*uh) + mss);
>> @@ -227,25 +226,31 @@ struct sk_buff *__udp_gso_segment(struct sk_buff 
>> *gso_skb,
>> ((__force u32)uh->len ^ 0x) +
>> (__force u32)newlen));
>>
>> -   for (seg = segs; seg; seg = seg->next) {
>> -   uh = udp_hdr(seg);
>> +   for (;;) {
>> +   seg->destructor = sock_wfree;
>> +   seg->sk = sk;
>> +   sum_truesize += seg->truesize;
>>
>> -   /* last packet can be partial gso_size */
>> -   if (!seg->next) {
>> -   newlen = htons(seg->len - hdrlen);
>> -   check = ~csum_fold((__force __wsum)((__force 
>> u32)uh->check +
>> -   ((__force 
>> u32)uh->len ^ 0x) +
>> -   (__force 
>> u32)newlen));
>> -   }
>> +   if (!seg->next)
>> +   break;
>
> Not critical, but I find replacing
>
>   for (seg = segs; seg; seg = seg->next) {
> uh = udp_hdr(seg);
> ...
>   }
>
> with
>
>   uh = udp_hdr(seg);
>   for (;;) {
> ...
> if (!seg->next)
>   break;
>
> uh = udp_hdr(seg);
>   }
>
> much less easy to parse and it inflates patch size. How about just
>
>   - for (seg = segs; seg; seg = seg->next)
>   + for( seg = segs; seg->next; seg = seg->next)
>
>

The problem is I need access to the UDP header before I start the
loop. That was why I pulled seg = segs and the "uh = udp_hdr(seg)" out
seperately

>>
>> uh->len = newlen;
>> uh->check = check;
>>
>> -   seg->destructor = sock_wfree;
>> -   seg->sk = sk;
>> -   sum_truesize += seg->truesize;
>> +   seg = seg->next;
>> +   uh = udp_hdr(seg);
>> }
>>
>> +   /* last packet can be partial gso_size, account for that in checksum 
>> */
>> +   newlen = htons(skb_tail_pointer(seg) - skb_transport_header(seg) +
>> +  seg->data_len);
>> +   check = ~csum_fold((__force __wsum)((__force u32)uh->check +
>> +   ((__force u32)uh->len ^ 0x)  
>> +
>> +   (__force u32)newlen));
>> +   uh->len = newlen;
>> +   uh->check = check;
>> +
>> +   /* update refcount for the packet */
>> refcount_add(sum_truesize - gso_skb->truesize, &sk->sk_wmem_all

Re: [net-next PATCH v2 4/8] udp: Do not pass checksum as a parameter to GSO segmentation

2018-05-05 Thread Alexander Duyck

On Sat, May 5, 2018 at 3:01 AM, Willem de Bruijn
 wrote:
> On Fri, May 4, 2018 at 8:30 PM, Alexander Duyck
>  wrote:
>> From: Alexander Duyck 
>>
>> This patch is meant to allow us to avoid having to recompute the checksum
>> from scratch and have it passed as a parameter.
>>
>> Instead of taking that approach we can take advantage of the fact that the
>> length that was used to compute the existing checksum is included in the
>> UDP header. If we cancel that out by adding the value XOR with 0x we
>> can then just add the new length in and fold that into the new result.
>>
>> I think this may be fixing a checksum bug in the original code as well
>> since the checksum that was passed included the UDP header in the checksum
>> computation, but then excluded it for the adjustment on the last frame. I
>> believe this may have an effect on things in the cases where the two differ
>> by bits that would result in things crossing the byte boundaries.
>
> The replacement code, below, subtracts original payload size then adds
> the new payload size. mss here excludes the udp header size.
>
>> /* last packet can be partial gso_size */
>> -   if (!seg->next)
>> -   csum_replace2(&uh->check, htons(mss),
>> - htons(seg->len - hdrlen - 
>> sizeof(*uh)));

That is my point. When you calculated your checksum you included the
UDP header in the calculation.

-   return __udp_gso_segment(gso_skb, features,
-udp_v4_check(sizeof(struct udphdr) + mss,
- iph->saddr, iph->daddr, 0));

Basically the problem is in one spot you are adding the sizeof(struct
udphdr) + mss and then in another you are cancelling it out as mss and
trying to account for it by also dropping the UDP header from the
payload length of the value you are adding. That works in the cases
where the effect doesn't cause any issues with the byte ordering,
however I think when mss + 8 crosses a byte boundary it can lead to
issues since the calculation is done on a byte swapped value.

Re: WARNING in kernfs_add_one

2018-05-05 Thread Eric Dumazet



On 05/05/2018 09:40 AM, Greg KH wrote:
> On Sat, May 05, 2018 at 08:47:02AM -0700, syzbot wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:8fb11a9a8d51 net/ipv6: rename rt6_next to fib6_next
>> git tree:   net-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=14b2723780
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=c416c61f3cd96be
>> dashboard link: https://syzkaller.appspot.com/bug?extid=df47f81c226b31d89fb1
>> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
>> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=172fb3e780
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16552e5780
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+df47f81c226b31d89...@syzkaller.appspotmail.com
>>
>> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534
>> R10:  R11: 0246 R12: 
>> R13: 0006 R14:  R15: 
>> [ cut here ]
>> kernfs: ns required in 'ieee80211' for 'phy3'
> 
> That's interesting, this looks like a netfilter bug (adding netdev to
> the report here.)


I do not see anything netfilter related here.

More likely wireless territory

> 
> Yes, we can "tone down" the kernfs warning to just be an error message
> in the log, but there might be something worse going on here.
> 
> Network developers, any idea?  Rest of the callback chain is here:
> 
> 
>> WARNING: CPU: 0 PID: 4538 at fs/kernfs/dir.c:759 kernfs_add_one+0x406/0x4d0
>> fs/kernfs/dir.c:758
>> Kernel panic - not syncing: panic_on_warn set ...
>>
>> CPU: 0 PID: 4538 Comm: syz-executor486 Not tainted 4.17.0-rc3+ #33
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>>  panic+0x22f/0x4de kernel/panic.c:184
>>  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>>  report_bug+0x252/0x2d0 lib/bug.c:186
>>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
>>  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
>> RIP: 0010:kernfs_add_one+0x406/0x4d0 fs/kernfs/dir.c:758
>> RSP: 0018:8801ca9eece0 EFLAGS: 00010286
>> RAX: 002d RBX: 87d5cee0 RCX: 8160ba7d
>> RDX:  RSI: 81610731 RDI: 8801ca9ee840
>> RBP: 8801ca9eed20 R08: 8801d9538500 R09: 0006
>> R10: 8801d9538500 R11:  R12: 8801ad1cb6c0
>> R13: 885da640 R14: 0020 R15: 
>>  kernfs_create_link+0x112/0x180 fs/kernfs/symlink.c:41
>>  sysfs_do_create_link_sd.isra.2+0x90/0x130 fs/sysfs/symlink.c:43
>>  sysfs_do_create_link fs/sysfs/symlink.c:79 [inline]
>>  sysfs_create_link+0x65/0xc0 fs/sysfs/symlink.c:91
>>  device_add_class_symlinks drivers/base/core.c:1612 [inline]
>>  device_add+0x7a0/0x16d0 drivers/base/core.c:1810
>>  wiphy_register+0x178a/0x2430 net/wireless/core.c:806
>>  ieee80211_register_hw+0x13cd/0x35d0 net/mac80211/main.c:1047
>>  mac80211_hwsim_new_radio+0x1d9b/0x3410
>> drivers/net/wireless/mac80211_hwsim.c:2772
>>  hwsim_new_radio_nl+0x7a7/0xa60 drivers/net/wireless/mac80211_hwsim.c:3246
>>  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
>>  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
>>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
>>  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
>>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
>>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
>>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xd5/0x120 net/socket.c:639
>>  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
>>  __sys_sendmsg+0x115/0x270 net/socket.c:2155
>>  __do_sys_sendmsg net/socket.c:2164 [inline]
>>  __se_sys_sendmsg net/socket.c:2162 [inline]
>>  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
>>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> RIP: 0033:0x4404c9
>> RSP: 002b:7fff808f3e08 EFLAGS: 0246 ORIG_RAX: 002e
>> RAX: ffda RBX:  RCX: 004404c9
>> RDX:  RSI: 20b3dfc8 RDI: 0005
>> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534
>> R10:  R11: 0246 R12: 
>> R13: 0006 R14:  R15: 
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Kernel Offset: disabled
>> Rebooting in 86400 seconds..
> 
> 
> Any ideas?
> 
> thanks,
> 
> greg k-h
>

Re: [PATCH net-next v2 02/13] net: phy: sfp: handle non-wired SFP connectors

2018-05-05 Thread Andrew Lunn

On Sat, May 05, 2018 at 05:39:45PM +0200, Thomas Petazzoni wrote:
> Hello,
> 
> On Fri, 4 May 2018 19:23:37 +0200, Antoine Tenart wrote:
> 
> > On Fri, May 04, 2018 at 10:04:48AM -0700, Florian Fainelli wrote:
> > > On 05/04/2018 06:56 AM, Antoine Tenart wrote:  
> > > > SFP connectors can be solder on a board without having any of their pins
> > > > (LOS, i2c...) wired. In such cases the SFP link state cannot be guessed,
> > > > and the overall link status reporting is left to other layers.
> > > > 
> > > > In order to achieve this, a new SFP_DEV status is added, named UNKNOWN.
> > > > This mode is set when it is not possible for the SFP code to get the
> > > > link status and as a result the link status is reported to be always UP
> > > > from the SFP point of view.  
> > > 
> > > Why represent the SFP in Device Tree then? Why not just declare this is
> > > a fixed link which would avoid having to introduce this "unknown" state.  
> > 
> > The other solution would have been to represent this as a fixed-link.
> > But such a representation would report the link as being up all the
> > time, which is something we wanted to avoid as the GoP in PPv2 can
> > report some link status. This is achieved using SFP+phylink+PPv2.
> > 
> > And representing the SFP cage in the device tree, although it's a
> > "dummy" one, helps describing the hardware.
> 
> Just to add to this: the board physically has a SFP cage, and a cable
> can be connected to it, or not. So it is absolutely not a fixed link
> (cable can be connected or not) and it really is a SFP cage.

Hi Thomas

What i have heard on the rumour mill is that the hardware design is
FUBAR. The i2c-mux and i2c-gpio expanders are using the same
addresses. Or something like that. So the data plane from the MAC to
the SFP works. But the control plane is broken.

I don't really have a problem listing an SFP in device tree. As you
say, it physically exists on the boards. But because of the FUBAR
hardware, it cannot be controlled. phylink is all about the control
plane, and there is no control plane for this hardware. So connecting
this sfp to phylink seems very wrong. When we have no control plain,
we use a fixed-link.

Isn't this hardware a reference design? It is not a real product.  If
it is an RDK, i'm sure Marvell are telling people it is FUBAR, don't
copy it. There will be a new RDK sometime which has the problems
fixed. So how much effort should we put into supporting a broken RDK,
which nobody will copy into a real product? To me, KISS is the right
approach and document it in the device tree what the issue is.

If a real product comes to market which is equally FUBAR, we can then
consider how to get the best out of the hardware. We can extend
phylink to support a fixed link PHY, but still ask the MAC about its
link status.

 Andrew

possible deadlock in sk_diag_fill

2018-05-05 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=12164c9780
kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
userspace arch: i386

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c1872be62e587eae9...@syzkaller.appspotmail.com


==
WARNING: possible circular locking dependency detected
4.17.0-rc3+ #59 Not tainted
--
syz-executor1/25282 is trying to acquire lock:
4fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons  
net/unix/diag.c:82 [inline]
4fddf743 (&(&u->lock)->rlock/1){+.+.}, at:  
sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144


but task is already holding lock:
b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock  
include/linux/spinlock.h:310 [inline]
b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons  
net/unix/diag.c:64 [inline]
b6895645 (rlock-AF_UNIX){+.+.}, at:  
sk_diag_fill.isra.5+0x94e/0x10d0 net/unix/diag.c:144


which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (rlock-AF_UNIX){+.+.}:
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
   skb_queue_tail+0x26/0x150 net/core/skbuff.c:2900
   unix_dgram_sendmsg+0xf77/0x1730 net/unix/af_unix.c:1797
   sock_sendmsg_nosec net/socket.c:629 [inline]
   sock_sendmsg+0xd5/0x120 net/socket.c:639
   ___sys_sendmsg+0x525/0x940 net/socket.c:2117
   __sys_sendmmsg+0x3bb/0x6f0 net/socket.c:2205
   __compat_sys_sendmmsg net/compat.c:770 [inline]
   __do_compat_sys_sendmmsg net/compat.c:777 [inline]
   __se_compat_sys_sendmmsg net/compat.c:774 [inline]
   __ia32_compat_sys_sendmmsg+0x9f/0x100 net/compat.c:774
   do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
   do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

-> #0 (&(&u->lock)->rlock/1){+.+.}:
   lock_acquire+0x1dc/0x520 kernel/locking/lockdep.c:3920
   _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
   sk_diag_dump_icons net/unix/diag.c:82 [inline]
   sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
   sk_diag_dump net/unix/diag.c:178 [inline]
   unix_diag_dump+0x35f/0x550 net/unix/diag.c:206
   netlink_dump+0x507/0xd20 net/netlink/af_netlink.c:2226
   __netlink_dump_start+0x51a/0x780 net/netlink/af_netlink.c:2323
   netlink_dump_start include/linux/netlink.h:214 [inline]
   unix_diag_handler_dump+0x3f4/0x7b0 net/unix/diag.c:307
   __sock_diag_cmd net/core/sock_diag.c:230 [inline]
   sock_diag_rcv_msg+0x2e0/0x3d0 net/core/sock_diag.c:261
   netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
   sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
   netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
   netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
   netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
   sock_sendmsg_nosec net/socket.c:629 [inline]
   sock_sendmsg+0xd5/0x120 net/socket.c:639
   sock_write_iter+0x35a/0x5a0 net/socket.c:908
   call_write_iter include/linux/fs.h:1784 [inline]
   new_sync_write fs/read_write.c:474 [inline]
   __vfs_write+0x64d/0x960 fs/read_write.c:487
   vfs_write+0x1f8/0x560 fs/read_write.c:549
   ksys_write+0xf9/0x250 fs/read_write.c:598
   __do_sys_write fs/read_write.c:610 [inline]
   __se_sys_write fs/read_write.c:607 [inline]
   __ia32_sys_write+0x71/0xb0 fs/read_write.c:607
   do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
   do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
   entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(rlock-AF_UNIX);
   lock(&(&u->lock)->rlock/1);
   lock(rlock-AF_UNIX);
  lock(&(&u->lock)->rlock/1);

 *** DEADLOCK ***

5 locks held by syz-executor1/25282:
 #0: 3919e1bd (sock_diag_mutex){+.+.}, at: sock_diag_rcv+0x1b/0x40  
net/core/sock_diag.c:271
 #1: 4f328d3e (sock_diag_table_mutex){+.+.}, at: __sock_diag_cmd  
net/core/sock_diag.c:225 [inline]
 #1: 4f328d3e (sock_diag_table_mutex){+.+.}, at:  
sock_diag_rcv_msg+0x169/0x3d0 net

Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-05-05 Thread Kees Cook

On Sat, May 5, 2018 at 8:39 AM, Andrew Lunn  wrote:
> On Sat, May 05, 2018 at 12:36:36PM +0200, Salvatore Mesoraca wrote:
>> 2018-03-13 21:06 GMT+01:00 Florian Fainelli :
>> > On 03/13/2018 12:58 PM, Vivien Didelot wrote:
>> >> Hi Salvatore,
>> >>
>> >> Salvatore Mesoraca  writes:
>> >>
>> >>> dsa_switch's num_ports is currently fixed to DSA_MAX_PORTS. So we avoid
>> >>> 2 VLAs[1] by using DSA_MAX_PORTS instead of ds->num_ports.
>> >>>
>> >>> [1] https://lkml.org/lkml/2018/3/7/621
>> >>>
>> >>> Signed-off-by: Salvatore Mesoraca 
>> >>
>> >> NAK.
>> >>
>> >> We are in the process to remove hardcoded limits such as DSA_MAX_PORTS
>> >> and DSA_MAX_SWITCHES, so we have to stick with ds->num_ports.
>> >
>> > Then this means that we need to allocate a bitmap from the heap, which
>> > sounds a bit superfluous and could theoretically fail... not sure which
>> > way is better, but bumping the size to DSA_MAX_PORTS definitively does
>> > help people working on enabling -Wvla.
>>
>> Hi Florian,
>>
>> Should I consider this patch still NAKed or not?
>> Should I resend the patch with some modifications?
>
> Hi Salvatore
>
> We have been removing all uses of DSA_MAX_PORTS. I don't particularly
> like arbitrary limits on how many ports a switch can have, or how many
> switches a board can have.
>
> So i would prefer to not use DSA_MAX_PORTS here.
>
> You could make the bitmap part of the dsa_switch structure. This is
> allocated by dsa_switch_alloc() and is passed the number of ports.
> Doing the allocation there means you don't need to worry about it
> failing in dsa_switch_mdb_add() or dsa_switch_vlan_add().

Are dsa_switch_mdb_add() and dsa_switch_vlan_add() guaranteed to be
single-threaded?

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH iproute2-next] bpf: don't offload perf array maps

2018-05-05 Thread David Ahern

On 5/4/18 6:37 PM, Jakub Kicinski wrote:
> Perf arrays are handled specially by the kernel, don't request
> offload even when used by an offloaded program.
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 
> ---
>  lib/bpf.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)

applied to iproute2-next.

Re: [PATCH] net: dsa: drop some VLAs in switch.c

2018-05-05 Thread Andrew Lunn

> > You could make the bitmap part of the dsa_switch structure. This is
> > allocated by dsa_switch_alloc() and is passed the number of ports.
> > Doing the allocation there means you don't need to worry about it
> > failing in dsa_switch_mdb_add() or dsa_switch_vlan_add().
> 
> Are dsa_switch_mdb_add() and dsa_switch_vlan_add() guaranteed to be
> single-threaded?

Yes, that is the interesting question here against each other, or
themselves?

They are called from a notifier chain. It is the same notifier chain
for both dsa_switch_mdb_add() and dsa_switch_vlan_add().

notifier_call_chain() itself appears to not provide any guarantees
about the same handler being called in parallel.

It is dsa_port_notify() which is calling the notifier_call_chain().
This is being called by both dsa_port_vlan_add() and
dsa_port_mdb_add() in dsa_slave_port_obj_add(). This is a switchdev
op. switchdev_port_obj_add_now() does have ASSERT_RTNL(); So that
should serialize everything.

 Andrew

[PATCH net] net: dsa: mv88e6xxx: Fix PHY interrupts by parameterising PHY base address

2018-05-05 Thread Andrew Lunn

Most of the mv88e6xxx switches have the PHYs at address 0, 1, 2, ...
The 6341 however has the PHYs at 0x10, 0x11, 0x12. Add a parameter to
the info structure for this base address.

Testing of 6f88284f3bd7 ("net: dsa: mv88e6xxx: Add MDIO interrupts for
internal PHYs") was performed on the 6341. So it works only on the
6341. Use this base information to correctly set the interrupt.

Fixes: 6f88284f3bd7 ("net: dsa: mv88e6xxx: Add MDIO interrupts for internal 
PHYs")
Signed-off-by: Andrew Lunn 
---

This is a rather big patch for net, but it is simple. What it is
fixing is also not too big an issue. Since the interrupt number is in
the wrong place, phylib does not find it, and so does polling. So i
would understand if this goes to net-next, not net.

drivers/net/dsa/mv88e6xxx/chip.c| 26 ++
 drivers/net/dsa/mv88e6xxx/chip.h|  1 +
 drivers/net/dsa/mv88e6xxx/global2.c |  2 +-
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3d2091099f7f..5b4374f21d76 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -3370,6 +3370,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 5,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3391,6 +3392,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 0,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3410,6 +3412,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 8,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3431,6 +3434,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 5,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3452,6 +3456,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 0,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3472,6 +3477,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_gpio = 11,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x10,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 3750,
@@ -3493,6 +3499,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 5,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3514,6 +3521,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 0,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3535,6 +3543,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 5,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3557,6 +3566,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_gpio = 15,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3578,6 +3588,7 @@ static const struct mv88e6xxx_info mv88e6xxx_table[] = {
.num_internal_phys = 5,
.max_vid = 4095,
.port_base_addr = 0x10,
+   .phy_base_addr = 0x0,
.global1_addr = 0x1b,
.global2_addr = 0x1c,
.age_time_coeff = 15000,
@@ -3600,6 +3611,7 @@ static cons

[PATCH net-next 1/9] net: phy: phylink: Use gpiod_get_value_cansleep()

2018-05-05 Thread Florian Fainelli

The GPIO provider for the link GPIO line might require the use of the
_cansleep() API, utilize that. This is safe to do since we run in workqueue
context.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phylink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index c582b2d7546c..412d1cf4fa66 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -360,7 +360,7 @@ static void phylink_get_fixed_state(struct phylink *pl, 
struct phylink_link_stat
if (pl->get_fixed_state)
pl->get_fixed_state(pl->netdev, state);
else if (pl->link_gpio)
-   state->link = !!gpiod_get_value(pl->link_gpio);
+   state->link = !!gpiod_get_value_cansleep(pl->link_gpio);
 }
 
 /* Flow control is resolved according to our and the link partners
-- 
2.17.0

[PATCH net-next 3/9] net: phy: phylink: Poll link GPIOs

2018-05-05 Thread Florian Fainelli

From: Russell King 

When using a fixed link with a link GPIO, we need to poll that GPIO to
determine link state changes. This is consistent with what fixed_phy.c does.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phylink.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 6392b5248cf5..581ce93ecaf9 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "sfp.h"
@@ -54,6 +55,7 @@ struct phylink {
/* The link configuration settings */
struct phylink_link_state link_config;
struct gpio_desc *link_gpio;
+   struct timer_list link_poll;
void (*get_fixed_state)(struct net_device *dev,
struct phylink_link_state *s);
 
@@ -500,6 +502,15 @@ static void phylink_run_resolve(struct phylink *pl)
queue_work(system_power_efficient_wq, &pl->resolve);
 }
 
+static void phylink_fixed_poll(struct timer_list *t)
+{
+   struct phylink *pl = container_of(t, struct phylink, link_poll);
+
+   mod_timer(t, jiffies + HZ);
+
+   phylink_run_resolve(pl);
+}
+
 static const struct sfp_upstream_ops sfp_phylink_ops;
 
 static int phylink_register_sfp(struct phylink *pl,
@@ -572,6 +583,7 @@ struct phylink *phylink_create(struct net_device *ndev,
pl->link_config.an_enabled = true;
pl->ops = ops;
__set_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state);
+   timer_setup(&pl->link_poll, phylink_fixed_poll, 0);
 
bitmap_fill(pl->supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
linkmode_copy(pl->link_config.advertising, pl->supported);
@@ -905,6 +917,8 @@ void phylink_start(struct phylink *pl)
clear_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state);
phylink_run_resolve(pl);
 
+   if (pl->link_an_mode == MLO_AN_FIXED && !IS_ERR(pl->link_gpio))
+   mod_timer(&pl->link_poll, jiffies + HZ);
if (pl->sfp_bus)
sfp_upstream_start(pl->sfp_bus);
if (pl->phydev)
@@ -929,6 +943,8 @@ void phylink_stop(struct phylink *pl)
phy_stop(pl->phydev);
if (pl->sfp_bus)
sfp_upstream_stop(pl->sfp_bus);
+   if (pl->link_an_mode == MLO_AN_FIXED && !IS_ERR(pl->link_gpio))
+   del_timer_sync(&pl->link_poll);
 
set_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state);
queue_work(system_power_efficient_wq, &pl->resolve);
-- 
2.17.0

[PATCH net-next 4/9] net: dsa: Add PHYLINK switch operations

2018-05-05 Thread Florian Fainelli

In preparation for adding support for PHYLINK within DSA, define a number of
operations that we will need and that switch drivers can start implementing.
Proper integration with PHYLINK will follow in subsequent patches.

We start selecting PHYLINK (which implies PHYLIB) in net/dsa/Kconfig
such that drivers can be guaranteed that this dependency is properly
taken care of and can start referencing PHYLINK helper functions without
requiring stubs or anything.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h | 24 
 net/dsa/Kconfig   |  2 +-
 net/dsa/slave.c   |  5 +
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 462e9741b210..ed64c1f3f117 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -20,12 +20,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
 struct tc_action;
 struct phy_device;
 struct fixed_phy_status;
+struct phylink_link_state;
 
 enum dsa_tag_protocol {
DSA_TAG_PROTO_NONE = 0,
@@ -353,6 +355,27 @@ struct dsa_switch_ops {
void(*fixed_link_update)(struct dsa_switch *ds, int port,
struct fixed_phy_status *st);
 
+   /*
+* PHYLINK integration
+*/
+   void(*phylink_validate)(struct dsa_switch *ds, int port,
+   unsigned long *supported,
+   struct phylink_link_state *state);
+   int (*phylink_mac_link_state)(struct dsa_switch *ds, int port,
+ struct phylink_link_state *state);
+   void(*phylink_mac_config)(struct dsa_switch *ds, int port,
+ unsigned int mode,
+ const struct phylink_link_state *state);
+   void(*phylink_mac_an_restart)(struct dsa_switch *ds, int port);
+   void(*phylink_mac_link_down)(struct dsa_switch *ds, int port,
+unsigned int mode,
+phy_interface_t interface);
+   void(*phylink_mac_link_up)(struct dsa_switch *ds, int port,
+  unsigned int mode,
+  phy_interface_t interface,
+  struct phy_device *phydev);
+   void(*phylink_fixed_state)(struct dsa_switch *ds, int port,
+  struct phylink_link_state *state);
/*
 * ethtool hardware statistics.
 */
@@ -595,5 +618,6 @@ static inline int call_dsa_notifiers(unsigned long val, 
struct net_device *dev,
 int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
 int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
 int dsa_port_get_phy_sset_count(struct dsa_port *dp);
+void dsa_port_phylink_mac_change(struct dsa_switch *ds, int port, bool up);
 
 #endif
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index bbf2c82cf7b2..4183e4ba27a5 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -9,7 +9,7 @@ config NET_DSA
depends on HAVE_NET_DSA && MAY_USE_DEVLINK
depends on BRIDGE || BRIDGE=n
select NET_SWITCHDEV
-   select PHYLIB
+   select PHYLINK
---help---
  Say Y if you want to enable support for the hardware switches 
supported
  by the Distributed Switch Architecture.
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index f3fb3a0880b1..e9be9458aa94 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1119,6 +1119,11 @@ static int dsa_slave_phy_connect(struct net_device 
*slave_dev, int addr)
  dsa_slave_adjust_link, p->phy_interface);
 }
 
+void dsa_port_phylink_mac_change(struct dsa_switch *ds, int port, bool up)
+{
+}
+EXPORT_SYMBOL_GPL(dsa_port_phylink_mac_change);
+
 static int dsa_slave_phy_setup(struct net_device *slave_dev)
 {
struct dsa_port *dp = dsa_slave_to_port(slave_dev);
-- 
2.17.0

[PATCH net-next 0/9] net: dsa: Plug in PHYLINK support

2018-05-05 Thread Florian Fainelli

Hi all,

This patch series adds PHYLINK support to DSA which is necessary to support more
complex PHY and pluggable modules setups.

Patch series can be found here:

https://github.com/ffainelli/linux/commits/dsa-phylink

This was tested on:

- dsa-loop
- bcm_sf2
- mv88e6xxx
- b53

With a variety of test cases:
- internal & external MDIO PHYs
- MoCA with link notification through interrupt/MMIO register
- built-in PHYs
- ifconfig up/down for several cycles works
- bind/unbind of the drivers

This is technically v2 of what was posted back in March 2018, changes from last
time:

- fixed probe/remove of drivers
- fixed missing gpiod_put() for link GPIOs
- fixed polling of link GPIOs (Russell I would need your SoB on the patch you
  provided offline initially, added some modifications to it)
- tested across a wider set of platforms

And everything should still work as expected. Please be aware of the following:

- switch drivers (like bcm_sf2) which may have user-facing network ports using
  fixed links would need to implement phylink_mac_ops to remain functional.
  PHYLINK does not create a phy_device for fixed links, therefore our
  call to adjust_link() from phylink_mac_link_{up,down} would not be calling
  into the driver. This *should not* affect CPU/DSA ports which are configured
  through adjust_link() but have no network devices

- support for SFP/SFF is now possible, but switch drivers will still need some
  modifications to properly support those, including, but not limited to using
  the correct binding information. This will be submitted on top of this series

Please do test on your respective platforms/switches and let me know if you
find any issues, hopefully everything still works like before.

Thank you!

Florian Fainelli (7):
  net: phy: phylink: Use gpiod_get_value_cansleep()
  net: phy: phylink: Release link GPIO
  net: dsa: Add PHYLINK switch operations
  net: dsa: bcm_sf2: Implement phylink_mac_ops
  net: dsa: Eliminate dsa_slave_get_link()
  net: dsa: Plug in PHYLINK support
  net: dsa: bcm_sf2: Get rid of PHYLIB functions

Russell King (2):
  net: phy: phylink: Poll link GPIOs
  net: dsa: mv88e6xxx: add PHYLINK support

 drivers/net/dsa/bcm_sf2.c| 191 
 drivers/net/dsa/mv88e6xxx/chip.c |  81 +
 drivers/net/dsa/mv88e6xxx/port.c |  39 
 drivers/net/dsa/mv88e6xxx/port.h |   3 +
 drivers/net/phy/phylink.c|  20 ++-
 include/net/dsa.h|  25 +++
 net/dsa/Kconfig  |   2 +-
 net/dsa/dsa_priv.h   |   9 -
 net/dsa/slave.c  | 293 ++-
 9 files changed, 456 insertions(+), 207 deletions(-)

-- 
2.17.0

[PATCH net-next 9/9] net: dsa: bcm_sf2: Get rid of PHYLIB functions

2018-05-05 Thread Florian Fainelli

Now that we have converted the bcm_sf2 driver to implement PHYLINK MAC
operations, we can remove the PHYLIB callbacks: adjust_link() and
fixed_link_update() which are no longer called by DSA.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 149 --
 1 file changed, 149 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index a20608b0329e..ac621f44237a 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -480,153 +480,6 @@ static u32 bcm_sf2_sw_get_phy_flags(struct dsa_switch 
*ds, int port)
return priv->hw_params.gphy_rev;
 }
 
-static void bcm_sf2_sw_adjust_link(struct dsa_switch *ds, int port,
-  struct phy_device *phydev)
-{
-   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
-   struct ethtool_eee *p = &priv->dev->ports[port].eee;
-   u32 id_mode_dis = 0, port_mode;
-   const char *str = NULL;
-   u32 reg, offset;
-
-   if (priv->type == BCM7445_DEVICE_ID)
-   offset = CORE_STS_OVERRIDE_GMIIP_PORT(port);
-   else
-   offset = CORE_STS_OVERRIDE_GMIIP2_PORT(port);
-
-   switch (phydev->interface) {
-   case PHY_INTERFACE_MODE_RGMII:
-   str = "RGMII (no delay)";
-   id_mode_dis = 1;
-   case PHY_INTERFACE_MODE_RGMII_TXID:
-   if (!str)
-   str = "RGMII (TX delay)";
-   port_mode = EXT_GPHY;
-   break;
-   case PHY_INTERFACE_MODE_MII:
-   str = "MII";
-   port_mode = EXT_EPHY;
-   break;
-   case PHY_INTERFACE_MODE_REVMII:
-   str = "Reverse MII";
-   port_mode = EXT_REVMII;
-   break;
-   default:
-   /* All other PHYs: internal and MoCA */
-   goto force_link;
-   }
-
-   /* If the link is down, just disable the interface to conserve power */
-   if (!phydev->link) {
-   reg = reg_readl(priv, REG_RGMII_CNTRL_P(port));
-   reg &= ~RGMII_MODE_EN;
-   reg_writel(priv, reg, REG_RGMII_CNTRL_P(port));
-   goto force_link;
-   }
-
-   /* Clear id_mode_dis bit, and the existing port mode, but
-* make sure we enable the RGMII block for data to pass
-*/
-   reg = reg_readl(priv, REG_RGMII_CNTRL_P(port));
-   reg &= ~ID_MODE_DIS;
-   reg &= ~(PORT_MODE_MASK << PORT_MODE_SHIFT);
-   reg &= ~(RX_PAUSE_EN | TX_PAUSE_EN);
-
-   reg |= port_mode | RGMII_MODE_EN;
-   if (id_mode_dis)
-   reg |= ID_MODE_DIS;
-
-   if (phydev->pause) {
-   if (phydev->asym_pause)
-   reg |= TX_PAUSE_EN;
-   reg |= RX_PAUSE_EN;
-   }
-
-   reg_writel(priv, reg, REG_RGMII_CNTRL_P(port));
-
-   pr_info("Port %d configured for %s\n", port, str);
-
-force_link:
-   /* Force link settings detected from the PHY */
-   reg = SW_OVERRIDE;
-   switch (phydev->speed) {
-   case SPEED_1000:
-   reg |= SPDSTS_1000 << SPEED_SHIFT;
-   break;
-   case SPEED_100:
-   reg |= SPDSTS_100 << SPEED_SHIFT;
-   break;
-   }
-
-   if (phydev->link)
-   reg |= LINK_STS;
-   if (phydev->duplex == DUPLEX_FULL)
-   reg |= DUPLX_MODE;
-
-   core_writel(priv, reg, offset);
-
-   if (!phydev->is_pseudo_fixed_link)
-   p->eee_enabled = b53_eee_init(ds, port, phydev);
-}
-
-static void bcm_sf2_sw_fixed_link_update(struct dsa_switch *ds, int port,
-struct fixed_phy_status *status)
-{
-   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
-   u32 duplex, pause, offset;
-   u32 reg;
-
-   if (priv->type == BCM7445_DEVICE_ID)
-   offset = CORE_STS_OVERRIDE_GMIIP_PORT(port);
-   else
-   offset = CORE_STS_OVERRIDE_GMIIP2_PORT(port);
-
-   duplex = core_readl(priv, CORE_DUPSTS);
-   pause = core_readl(priv, CORE_PAUSESTS);
-
-   status->link = 0;
-
-   /* MoCA port is special as we do not get link status from CORE_LNKSTS,
-* which means that we need to force the link at the port override
-* level to get the data to flow. We do use what the interrupt handler
-* did determine before.
-*
-* For the other ports, we just force the link status, since this is
-* a fixed PHY device.
-*/
-   if (port == priv->moca_port) {
-   status->link = priv->port_sts[port].link;
-   /* For MoCA interfaces, also force a link down notification
-* since some version of the user-space daemon (mocad) use
-* cmd->autoneg to force the link, which messes up the PHY
-* state machine and make it go in PHY_FORCING state instead.
-*/
-   if (!status->link)
-

[PATCH net-next 7/9] net: dsa: mv88e6xxx: add PHYLINK support

2018-05-05 Thread Florian Fainelli

From: Russell King 

Add rudimentary phylink support to mv88e6xxx. This allows the driver
using user ports with fixed links to keep operating normally. User ports
with normal PHYs are not affected since the switch automatically manages
their link parameters. User facing ports which use a SFP/SFF with a
non-fixed link mode might require a call to phylink_mac_change() to
operate properly.

Signed-off-by: Russell King 
[florian: expand commit message]
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 81 
 drivers/net/dsa/mv88e6xxx/port.c | 39 +++
 drivers/net/dsa/mv88e6xxx/port.h |  3 ++
 3 files changed, 123 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9d62e4acc01b..f4e8a3469386 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "chip.h"
@@ -580,6 +581,81 @@ static void mv88e6xxx_adjust_link(struct dsa_switch *ds, 
int port,
dev_err(ds->dev, "p%d: failed to configure MAC\n", port);
 }
 
+static void mv88e6xxx_validate(struct dsa_switch *ds, int port,
+  unsigned long *supported,
+  struct phylink_link_state *state)
+{
+}
+
+static int mv88e6xxx_link_state(struct dsa_switch *ds, int port,
+   struct phylink_link_state *state)
+{
+   struct mv88e6xxx_chip *chip = ds->priv;
+   int err;
+
+   mutex_lock(&chip->reg_lock);
+   err = mv88e6xxx_port_link_state(chip, port, state);
+   mutex_unlock(&chip->reg_lock);
+
+   return err;
+}
+
+static void mv88e6xxx_mac_config(struct dsa_switch *ds, int port,
+unsigned int mode,
+const struct phylink_link_state *state)
+{
+   struct mv88e6xxx_chip *chip = ds->priv;
+   int speed, duplex, err;
+
+   if (mode == MLO_AN_PHY)
+   return;
+
+   if (mode != MLO_AN_INBAND) {
+   speed = state->speed;
+   duplex = state->duplex;
+   } else {
+   speed = SPEED_UNFORCED;
+   duplex = DUPLEX_UNFORCED;
+   }
+
+   mutex_lock(&chip->reg_lock);
+   err = mv88e6xxx_port_setup_mac(chip, port, LINK_UNFORCED, speed, duplex,
+  state->interface);
+   mutex_unlock(&chip->reg_lock);
+
+   if (err && err != -EOPNOTSUPP)
+   dev_err(ds->dev, "p%d: failed to configure MAC\n", port);
+}
+
+static void mv88e6xxx_mac_link_force(struct dsa_switch *ds, int port, int link)
+{
+   struct mv88e6xxx_chip *chip = ds->priv;
+   int err;
+
+   mutex_lock(&chip->reg_lock);
+   err = chip->info->ops->port_set_link(chip, port, link);
+   mutex_unlock(&chip->reg_lock);
+
+   if (err)
+   dev_err(chip->dev, "p%d: failed to force MAC link\n", port);
+}
+
+static void mv88e6xxx_mac_link_down(struct dsa_switch *ds, int port,
+   unsigned int mode,
+   phy_interface_t interface)
+{
+   if (mode == MLO_AN_FIXED)
+   mv88e6xxx_mac_link_force(ds, port, LINK_FORCED_DOWN);
+}
+
+static void mv88e6xxx_mac_link_up(struct dsa_switch *ds, int port,
+ unsigned int mode, phy_interface_t interface,
+ struct phy_device *phydev)
+{
+   if (mode == MLO_AN_FIXED)
+   mv88e6xxx_mac_link_force(ds, port, LINK_FORCED_UP);
+}
+
 static int mv88e6xxx_stats_snapshot(struct mv88e6xxx_chip *chip, int port)
 {
if (!chip->info->ops->stats_snapshot)
@@ -4138,6 +4214,11 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops 
= {
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
.setup  = mv88e6xxx_setup,
.adjust_link= mv88e6xxx_adjust_link,
+   .phylink_validate   = mv88e6xxx_validate,
+   .phylink_mac_link_state = mv88e6xxx_link_state,
+   .phylink_mac_config = mv88e6xxx_mac_config,
+   .phylink_mac_link_down  = mv88e6xxx_mac_link_down,
+   .phylink_mac_link_up= mv88e6xxx_mac_link_up,
.get_strings= mv88e6xxx_get_strings,
.get_ethtool_stats  = mv88e6xxx_get_ethtool_stats,
.get_sset_count = mv88e6xxx_get_sset_count,
diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 6315774d72b3..429d0ebcd5b1 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "chip.h"
 #include "port.h"
@@ -378,6 +379,44 @@ int mv88e6xxx_port_get_cmode(struct mv88e6xxx_chip *chip, 
int port, u8 *cmode)
return 0;
 }
 
+int mv88e6xxx_port_link_state(struct mv88e6xxx_chip *chip, int port,
+ struct ph

[PATCH net-next 6/9] net: dsa: Eliminate dsa_slave_get_link()

2018-05-05 Thread Florian Fainelli

Since we use PHYLIB to manage the per-port link indication, this will
also be reflected correctly in the network device's carrier state, so we
can use ethtool_op_get_link() instead.

Signed-off-by: Florian Fainelli 
---
 net/dsa/slave.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index e9be9458aa94..e78894477598 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -498,16 +498,6 @@ dsa_slave_get_regs(struct net_device *dev, struct 
ethtool_regs *regs, void *_p)
ds->ops->get_regs(ds, dp->index, regs, _p);
 }
 
-static u32 dsa_slave_get_link(struct net_device *dev)
-{
-   if (!dev->phydev)
-   return -ENODEV;
-
-   genphy_update_link(dev->phydev);
-
-   return dev->phydev->link;
-}
-
 static int dsa_slave_get_eeprom_len(struct net_device *dev)
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
@@ -982,7 +972,7 @@ static const struct ethtool_ops dsa_slave_ethtool_ops = {
.get_regs_len   = dsa_slave_get_regs_len,
.get_regs   = dsa_slave_get_regs,
.nway_reset = phy_ethtool_nway_reset,
-   .get_link   = dsa_slave_get_link,
+   .get_link   = ethtool_op_get_link,
.get_eeprom_len = dsa_slave_get_eeprom_len,
.get_eeprom = dsa_slave_get_eeprom,
.set_eeprom = dsa_slave_set_eeprom,
-- 
2.17.0

[PATCH net-next 2/9] net: phy: phylink: Release link GPIO

2018-05-05 Thread Florian Fainelli

We are not releasing the link GPIO descriptor with gpiod_put() which results in
subsequent probing to get -EBUSY when calling fwnode_get_named_gpiod(). Fix this
by doing the release in phylink_destroy().

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phylink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 412d1cf4fa66..6392b5248cf5 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -612,6 +612,8 @@ void phylink_destroy(struct phylink *pl)
 {
if (pl->sfp_bus)
sfp_unregister_upstream(pl->sfp_bus);
+   if (!IS_ERR(pl->link_gpio))
+   gpiod_put(pl->link_gpio);
 
cancel_work_sync(&pl->resolve);
kfree(pl);
-- 
2.17.0

[PATCH net-next 5/9] net: dsa: bcm_sf2: Implement phylink_mac_ops

2018-05-05 Thread Florian Fainelli

Make the bcm_sf2 driver implement phylink_mac_ops since it needs to
support a wide variety of network interfaces: internal & external MDIO
PHYs, fixed PHYs, MoCA with MMIO link status.

A large amount of what needs to be done already exists under
bcm_sf2_sw_adjust_link() so we are essentially breaking this down into
the necessary operation for PHYLINK to work: mac_config, mac_link_up,
mac_link_down and validate. We can now entirely get rid of most of what
fixed_link_update() provided because only the link information is actually
necessary. We still have to force DUPLEX_FULL for legacy Device Tree bindings
that did not specify that before.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 214 --
 1 file changed, 206 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 97236cfcbae4..a20608b0329e 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -306,7 +307,8 @@ static int bcm_sf2_sw_mdio_write(struct mii_bus *bus, int 
addr, int regnum,
 
 static irqreturn_t bcm_sf2_switch_0_isr(int irq, void *dev_id)
 {
-   struct bcm_sf2_priv *priv = dev_id;
+   struct dsa_switch *ds = dev_id;
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
 
priv->irq0_stat = intrl2_0_readl(priv, INTRL2_CPU_STATUS) &
~priv->irq0_mask;
@@ -317,16 +319,21 @@ static irqreturn_t bcm_sf2_switch_0_isr(int irq, void 
*dev_id)
 
 static irqreturn_t bcm_sf2_switch_1_isr(int irq, void *dev_id)
 {
-   struct bcm_sf2_priv *priv = dev_id;
+   struct dsa_switch *ds = dev_id;
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
 
priv->irq1_stat = intrl2_1_readl(priv, INTRL2_CPU_STATUS) &
~priv->irq1_mask;
intrl2_1_writel(priv, priv->irq1_stat, INTRL2_CPU_CLEAR);
 
-   if (priv->irq1_stat & P_LINK_UP_IRQ(P7_IRQ_OFF))
-   priv->port_sts[7].link = 1;
-   if (priv->irq1_stat & P_LINK_DOWN_IRQ(P7_IRQ_OFF))
-   priv->port_sts[7].link = 0;
+   if (priv->irq1_stat & P_LINK_UP_IRQ(P7_IRQ_OFF)) {
+   priv->port_sts[7].link = true;
+   dsa_port_phylink_mac_change(ds, 7, true);
+   }
+   if (priv->irq1_stat & P_LINK_DOWN_IRQ(P7_IRQ_OFF)) {
+   priv->port_sts[7].link = false;
+   dsa_port_phylink_mac_change(ds, 7, false);
+   }
 
return IRQ_HANDLED;
 }
@@ -620,6 +627,192 @@ static void bcm_sf2_sw_fixed_link_update(struct 
dsa_switch *ds, int port,
status->pause = 1;
 }
 
+static void bcm_sf2_sw_validate(struct dsa_switch *ds, int port,
+   unsigned long *supported,
+   struct phylink_link_state *state)
+{
+   __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
+
+   if (!phy_interface_mode_is_rgmii(state->interface) &&
+   state->interface != PHY_INTERFACE_MODE_MII &&
+   state->interface != PHY_INTERFACE_MODE_REVMII &&
+   state->interface != PHY_INTERFACE_MODE_GMII &&
+   state->interface != PHY_INTERFACE_MODE_INTERNAL &&
+   state->interface != PHY_INTERFACE_MODE_MOCA) {
+   bitmap_zero(supported, __ETHTOOL_LINK_MODE_MASK_NBITS);
+   dev_err(ds->dev,
+   "Unsupported interface: %d\n", state->interface);
+   return;
+   }
+
+   /* Allow all the expected bits */
+   phylink_set(mask, Autoneg);
+   phylink_set_port_modes(mask);
+   phylink_set(mask, Pause);
+   phylink_set(mask, Asym_Pause);
+
+   /* With the exclusion of MII and Reverse MII, we support Gigabit,
+* including Half duplex
+*/
+   if (state->interface != PHY_INTERFACE_MODE_MII &&
+   state->interface != PHY_INTERFACE_MODE_REVMII) {
+   phylink_set(mask, 1000baseT_Full);
+   phylink_set(mask, 1000baseT_Half);
+   }
+
+   phylink_set(mask, 10baseT_Half);
+   phylink_set(mask, 10baseT_Full);
+   phylink_set(mask, 100baseT_Half);
+   phylink_set(mask, 100baseT_Full);
+
+   bitmap_and(supported, supported, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+   bitmap_and(state->advertising, state->advertising, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static void bcm_sf2_sw_mac_config(struct dsa_switch *ds, int port,
+ unsigned int mode,
+ const struct phylink_link_state *state)
+{
+   struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
+   u32 id_mode_dis = 0, port_mode;
+   u32 reg, offset;
+
+   if (priv->type == BCM7445_DEVICE_ID)
+   offset = CORE_STS_OVERRIDE_GMIIP_PORT(port);
+   else
+   offset = CORE_STS_OVERRIDE_GMIIP2_PORT(port);
+
+   switch (state

[PATCH net-next 8/9] net: dsa: Plug in PHYLINK support

2018-05-05 Thread Florian Fainelli

Add support for PHYLINK within the DSA subsystem in order to support more
complex devices such as pluggable (SFP) and non-pluggable (SFF) modules, 10G
PHYs, and traditional PHYs. Using PHYLINK allows us to drop some amount of
complexity we had while probing fixed and non-fixed PHYs using Device Tree.

Because PHYLINK separates the Ethernet MAC/port configuration into different
stages, we let switch drivers implement those, and for now, we maintain
functionality by calling dsa_slave_adjust_link() during
phylink_mac_link_{up,down} which provides semantically equivalent steps.

Drivers willing to take advantage of PHYLINK should implement the phylink_mac_*
operations that DSA wraps.

We cannot quite remove the adjust_link() callback just yet, because a number of
drivers rely on that for configuring their "CPU" and "DSA" ports, this is done
dsa_port_setup_phy_of() and dsa_port_fixed_link_register_of() still.

Drivers that utilize fixed links for user-facing ports (e.g: bcm_sf2) will need
to implement phylink_mac_ops from now on to preserve functionality, since 
PHYLINK
*does not* create a phy_device instance for fixed links.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h  |   1 +
 net/dsa/dsa_priv.h |   9 --
 net/dsa/slave.c| 294 ++---
 3 files changed, 172 insertions(+), 132 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index ed64c1f3f117..fdbd6082945d 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -201,6 +201,7 @@ struct dsa_port {
u8  stp_state;
struct net_device   *bridge_dev;
struct devlink_port devlink_port;
+   struct phylink  *pl;
/*
 * Original copy of the master netdev ethtool_ops
 */
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 053731473c99..3964c6f7a7c0 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -75,15 +75,6 @@ struct dsa_slave_priv {
/* DSA port data, such as switch, port index, etc. */
struct dsa_port *dp;
 
-   /*
-* The phylib phy_device pointer for the PHY connected
-* to this port.
-*/
-   phy_interface_t phy_interface;
-   int old_link;
-   int old_pause;
-   int old_duplex;
-
 #ifdef CONFIG_NET_POLL_CONTROLLER
struct netpoll  *netpoll;
 #endif
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index e78894477598..4a9abc66132f 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -97,8 +98,7 @@ static int dsa_slave_open(struct net_device *dev)
if (err)
goto clear_promisc;
 
-   if (dev->phydev)
-   phy_start(dev->phydev);
+   phylink_start(dp->pl);
 
return 0;
 
@@ -120,8 +120,7 @@ static int dsa_slave_close(struct net_device *dev)
struct net_device *master = dsa_slave_to_master(dev);
struct dsa_port *dp = dsa_slave_to_port(dev);
 
-   if (dev->phydev)
-   phy_stop(dev->phydev);
+   phylink_stop(dp->pl);
 
dsa_port_disable(dp, dev->phydev);
 
@@ -272,10 +271,7 @@ static int dsa_slave_ioctl(struct net_device *dev, struct 
ifreq *ifr, int cmd)
break;
}
 
-   if (!dev->phydev)
-   return -ENODEV;
-
-   return phy_mii_ioctl(dev->phydev, ifr, cmd);
+   return phylink_mii_ioctl(p->dp->pl, ifr, cmd);
 }
 
 static int dsa_slave_port_attr_set(struct net_device *dev,
@@ -498,6 +494,13 @@ dsa_slave_get_regs(struct net_device *dev, struct 
ethtool_regs *regs, void *_p)
ds->ops->get_regs(ds, dp->index, regs, _p);
 }
 
+static int dsa_slave_nway_reset(struct net_device *dev)
+{
+   struct dsa_port *dp = dsa_slave_to_port(dev);
+
+   return phylink_ethtool_nway_reset(dp->pl);
+}
+
 static int dsa_slave_get_eeprom_len(struct net_device *dev)
 {
struct dsa_port *dp = dsa_slave_to_port(dev);
@@ -609,6 +612,8 @@ static void dsa_slave_get_wol(struct net_device *dev, 
struct ethtool_wolinfo *w)
struct dsa_port *dp = dsa_slave_to_port(dev);
struct dsa_switch *ds = dp->ds;
 
+   phylink_ethtool_get_wol(dp->pl, w);
+
if (ds->ops->get_wol)
ds->ops->get_wol(ds, dp->index, w);
 }
@@ -619,6 +624,8 @@ static int dsa_slave_set_wol(struct net_device *dev, struct 
ethtool_wolinfo *w)
struct dsa_switch *ds = dp->ds;
int ret = -EOPNOTSUPP;
 
+   phylink_ethtool_set_wol(dp->pl, w);
+
if (ds->ops->set_wol)
ret = ds->ops->set_wol(ds, dp->index, w);
 
@@ -642,13 +649,7 @@ static int dsa_slave_set_eee(struct net_device *dev, 
struct ethtool_eee *e)
if (ret)
return ret;
 
-   if (e->eee_enabled) {
-   ret = phy_init_eee(dev->phydev, 0);
-   if (ret)
-   return ret

[PATCH] isdn: eicon: fix a missing-check bug

2018-05-05 Thread Wenwen Wang

In divasmain.c, the function divas_write() firstly invokes the function
diva_xdi_open_adapter() to open the adapter that matches with the adapter
number provided by the user, and then invokes the function diva_xdi_write()
to perform the write operation using the matched adapter. The two functions
diva_xdi_open_adapter() and diva_xdi_write() are located in diva.c.

In diva_xdi_open_adapter(), the user command is copied to the object 'msg'
from the userspace pointer 'src' through the function pointer 'cp_fn',
which eventually calls copy_from_user() to do the copy. Then, the adapter
number 'msg.adapter' is used to find out a matched adapter from the
'adapter_queue'. A matched adapter will be returned if it is found.
Otherwise, NULL is returned to indicate the failure of the verification on
the adapter number.

As mentioned above, if a matched adapter is returned, the function
diva_xdi_write() is invoked to perform the write operation. In this
function, the user command is copied once again from the userspace pointer
'src', which is the same as the 'src' pointer in diva_xdi_open_adapter() as
both of them are from the 'buf' pointer in divas_write(). Similarly, the
copy is achieved through the function pointer 'cp_fn', which finally calls
copy_from_user(). After the successful copy, the corresponding command
processing handler of the matched adapter is invoked to perform the write
operation.

It is obvious that there are two copies here from userspace, one is in
diva_xdi_open_adapter(), and one is in diva_xdi_write(). Plus, both of
these two copies share the same source userspace pointer, i.e., the 'buf'
pointer in divas_write(). Given that a malicious userspace process can race
to change the content pointed by the 'buf' pointer, this can pose potential
security issues. For example, in the first copy, the user provides a valid
adapter number to pass the verification process and a valid adapter can be
found. Then the user can modify the adapter number to an invalid number.
This way, the user can bypass the verification process of the adapter
number and inject inconsistent data.

To avoid such issues, this patch adds a check after the second copy in the
function diva_xdi_write(). If the adapter number is not equal to the one
obtained in the first copy, (-4) will be returned to divas_write(), which
will then return an error code -EINVAL.

Signed-off-by: Wenwen Wang 
---
 drivers/isdn/hardware/eicon/diva.c  | 6 +-
 drivers/isdn/hardware/eicon/divasmain.c | 3 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/isdn/hardware/eicon/diva.c 
b/drivers/isdn/hardware/eicon/diva.c
index 944a7f3..46cbf76 100644
--- a/drivers/isdn/hardware/eicon/diva.c
+++ b/drivers/isdn/hardware/eicon/diva.c
@@ -440,6 +440,7 @@ diva_xdi_write(void *adapter, void *os_handle, const void 
__user *src,
   int length, divas_xdi_copy_from_user_fn_t cp_fn)
 {
diva_os_xdi_adapter_t *a = (diva_os_xdi_adapter_t *) adapter;
+   diva_xdi_um_cfg_cmd_t *p;
void *data;
 
if (a->xdi_mbox.status & DIVA_XDI_MBOX_BUSY) {
@@ -461,7 +462,10 @@ diva_xdi_write(void *adapter, void *os_handle, const void 
__user *src,
 
length = (*cp_fn) (os_handle, data, src, length);
if (length > 0) {
-   if ((*(a->interface.cmd_proc))
+   p = (diva_xdi_um_cfg_cmd_t *) data;
+   if (a->controller != (int)p->adapter) {
+   length = -4;
+   } else if ((*(a->interface.cmd_proc))
(a, (diva_xdi_um_cfg_cmd_t *) data, length)) {
length = -3;
}
diff --git a/drivers/isdn/hardware/eicon/divasmain.c 
b/drivers/isdn/hardware/eicon/divasmain.c
index b9980e8..a03c658 100644
--- a/drivers/isdn/hardware/eicon/divasmain.c
+++ b/drivers/isdn/hardware/eicon/divasmain.c
@@ -614,6 +614,9 @@ static ssize_t divas_write(struct file *file, const char 
__user *buf,
case -3:
ret = -ENXIO;
break;
+   case -4:
+   ret = -EINVAL;
+   break;
}
DBG_TRC(("write: ret %d", ret));
return (ret);
-- 
2.7.4

Re: [PATCH v5 0/6] firmware_loader: cleanups for v4.18

2018-05-05 Thread Krzysztof Halasa

"Luis R. Rodriguez"  writes:

> So we can nuke CONFIG_WANXL_BUILD_FIRMWARE now?

I'm uncertain I understand why do you want it, or maybe what are you
trying to do at all.

And what use would wanxlfw.S (the assembly source) have if the option is
removed?

>> It's more about delivering the .S source for the firmware, I guess.
>> Nobody is expected to build it. The fw is about 2.5 KB and is directly
>> linked with the driver.
>
> :P Future work I guess would be to just use the firmware API and stuff
> it into linux-firmware?

Who's going to make it happen?
The last time I checked (several years ago), wanXL worked. Who's going
to test it after the change?

I assume linux-firmware could include fw source and there would be means
to build the binary.

Just to be sure: the wanXL firmware has exactly nothing to do with FW
loader, nothing depends on it (nor the other way around), it's just
(with the rest of the wanXL code) an old piece of a driver for an old
card.

The question is, what do we gain by messing with it?
-- 
Krzysztof Halasa

Re: [PATCH net-next] net: phy: sfp: handle cases where neither BR,min nor BR,max is given

2018-05-05 Thread Florian Fainelli

On May 4, 2018 8:21:03 AM PDT, Antoine Tenart  
wrote:
>When computing the bitrate using values read from an SFP module EEPROM,
>we use the nominal BR plus BR,min and BR,max to determine the
>boundaries. But in some cases BR,min and BR,max aren't provided, which
>led the SFP code to end up having the nominal value for both the
>minimum
>and maximum bitrate values. When using a passive cable, the nominal
>value should be used as the maximum one, and there is no minimum one
>so we should use 0.
>
>Signed-off-by: Antoine Tenart 
>---
>
>Hi Russell,
>
>I'm not completely sure about this patch as this case is not really
>specified in the specification. But the issue is there, and I've
>discuss
>this with others. It seemed logical (at least to us :)) to use the
>BR,nominal values as br_max and 0 as br_min when using a passive cable
>which only provides BR,nominal as this would be the highest rate at
>which the cable could work. And because it's passive, there should be
>no
>issues using it at a lower rate.
>
>I've tested this with one passive cable which only reports its
>BR,nominal (which was 10300) while it could be used when using
>1000baseX
>or 2500baseX modes.

Which SFP modules (vendor and model) exposed this out of curiosity? Russell and 
I already saw the Cotsworks modules having so e issues with checksums, so 
building a table of quirks would help. Thanks!

-- 
Florian

Re: [PATCH net-next v2 03/13] net: phy: sfp: warn the user when no tx_disable pin is available

2018-05-05 Thread Florian Fainelli

On May 4, 2018 10:14:25 AM PDT, Andrew Lunn  wrote:
>On Fri, May 04, 2018 at 10:07:53AM -0700, Florian Fainelli wrote:
>> On 05/04/2018 06:56 AM, Antoine Tenart wrote:
>> > In case no Tx disable pin is available the SFP modules will always
>be
>> > emitting. This could be an issue when using modules using laser as
>their
>> > light source as we would have no way to disable it when the fiber
>is
>> > removed. This patch adds a warning when registering an SFP cage
>which do
>> > not have its tx_disable pin wired or available.
>> 
>> Is this something that was done in a possibly earlier revision of a
>> given board design and which was finally fixed? Nothing wrong with
>the
>> patch, but this seems like a pretty serious board design mistake,
>that
>> needs to be addressed.
>
>Hi Florian
>
>Zii Devel B is like this. Only the "Signal Detect" pin is wired to a
>GPIO.

Good point, indeed. BTW what do you think about exposing the SFF's EEPROM and 
diagnostics through the standard ethtool operations even if we have to keep the 
description of the SFF as a fixed link in Device Tree because of the 
unfortunate wiring?

-- 
Florian

Re: [PATCH net-next v2 03/13] net: phy: sfp: warn the user when no tx_disable pin is available

2018-05-05 Thread Andrew Lunn

On Sat, May 05, 2018 at 01:38:31PM -0700, Florian Fainelli wrote:
> On May 4, 2018 10:14:25 AM PDT, Andrew Lunn  wrote:
> >On Fri, May 04, 2018 at 10:07:53AM -0700, Florian Fainelli wrote:
> >> On 05/04/2018 06:56 AM, Antoine Tenart wrote:
> >> > In case no Tx disable pin is available the SFP modules will always
> >be
> >> > emitting. This could be an issue when using modules using laser as
> >their
> >> > light source as we would have no way to disable it when the fiber
> >is
> >> > removed. This patch adds a warning when registering an SFP cage
> >which do
> >> > not have its tx_disable pin wired or available.
> >> 
> >> Is this something that was done in a possibly earlier revision of a
> >> given board design and which was finally fixed? Nothing wrong with
> >the
> >> patch, but this seems like a pretty serious board design mistake,
> >that
> >> needs to be addressed.
> >
> >Hi Florian
> >
> >Zii Devel B is like this. Only the "Signal Detect" pin is wired to a
> >GPIO.
> 

> Good point, indeed. BTW what do you think about exposing the SFF's
> EEPROM and diagnostics through the standard ethtool operations even
> if we have to keep the description of the SFF as a fixed link in
> Device Tree because of the unfortunate wiring?

I believe in Antoine case, all the control plane is broken. He cannot
read the EEPROM, nor any of the modules pins via GPIOs.

For Zii Devel B, the EEPROM is accessible, and so is the SD pin. What
is missing is transmit disable. So i would expose it as an SFF module.

   Andrew

[RFC PATCH 3/3] arcnet: com20020: Add ethtool support

2018-05-05 Thread Andrea Greco

From: Andrea Greco 

Setup ethtols for export com20020 diag register

Signed-off-by: Andrea Greco 
---
 drivers/net/arcnet/com20020-isa.c|  1 +
 drivers/net/arcnet/com20020-membus.c |  1 +
 drivers/net/arcnet/com20020.c| 29 +
 drivers/net/arcnet/com20020.h|  1 +
 drivers/net/arcnet/com20020_cs.c |  1 +
 include/uapi/linux/if_arcnet.h   |  6 ++
 6 files changed, 39 insertions(+)

diff --git a/drivers/net/arcnet/com20020-isa.c 
b/drivers/net/arcnet/com20020-isa.c
index 38fa60ddaf2e..44ab6dcccb58 100644
--- a/drivers/net/arcnet/com20020-isa.c
+++ b/drivers/net/arcnet/com20020-isa.c
@@ -154,6 +154,7 @@ static int __init com20020_init(void)
dev->dev_addr[0] = node;
 
dev->netdev_ops = &com20020_netdev_ops;
+   dev->ethtool_ops = &com20020_ethtool_ops;
 
lp = netdev_priv(dev);
lp->backplane = backplane;
diff --git a/drivers/net/arcnet/com20020-membus.c 
b/drivers/net/arcnet/com20020-membus.c
index 6e4a2f3a84f7..9eead734a3cf 100644
--- a/drivers/net/arcnet/com20020-membus.c
+++ b/drivers/net/arcnet/com20020-membus.c
@@ -91,6 +91,7 @@ static int com20020_probe(struct platform_device *pdev)
 
dev = alloc_arcdev(NULL);// Let autoassign name arc%d
dev->netdev_ops = &com20020_netdev_ops;
+   dev->ethtool_ops = &com20020_ethtool_ops;
lp = netdev_priv(dev);
 
lp->card_flags = ARC_CAN_10MBIT;/* pretend all of them can 10Mbit */
diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index abd32ed8ec9b..2089b45e81c8 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -201,6 +201,34 @@ const struct net_device_ops com20020_netdev_ops = {
.ndo_set_rx_mode = com20020_set_mc_list,
 };
 
+static int com20020_ethtool_regs_len(struct net_device *netdev)
+{
+   return sizeof(struct com20020_ethtool_regs);
+}
+
+static void com20020_ethtool_regs_read(struct net_device *dev,
+  struct ethtool_regs *regs, void *p)
+{
+   struct arcnet_local *lp;
+   struct com20020_ethtool_regs *com_reg;
+
+   lp = netdev_priv(dev);
+   memset(p, 0, sizeof(struct com20020_ethtool_regs));
+
+   regs->version = 1;
+
+   com_reg = p;
+
+   com_reg->status = lp->hw.status(dev) & 0xFF;
+   com_reg->diag_register = (lp->hw.status(dev) >> 8) & 0xFF;
+   com_reg->reconf_count = lp->num_recons;
+}
+
+const struct ethtool_ops com20020_ethtool_ops = {
+   .get_regs = com20020_ethtool_regs_read,
+   .get_regs_len  = com20020_ethtool_regs_len,
+};
+
 /* Set up the struct net_device associated with this card.  Called after
  * probing succeeds.
  */
@@ -402,6 +430,7 @@ static void com20020_set_mc_list(struct net_device *dev)
 EXPORT_SYMBOL(com20020_check);
 EXPORT_SYMBOL(com20020_found);
 EXPORT_SYMBOL(com20020_netdev_ops);
+EXPORT_SYMBOL(com20020_ethtool_ops);
 #endif
 
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/arcnet/com20020.h b/drivers/net/arcnet/com20020.h
index 0bcc5d0a6903..a1024c8f8a1f 100644
--- a/drivers/net/arcnet/com20020.h
+++ b/drivers/net/arcnet/com20020.h
@@ -31,6 +31,7 @@
 int com20020_check(struct net_device *dev);
 int com20020_found(struct net_device *dev, int shared);
 extern const struct net_device_ops com20020_netdev_ops;
+extern const struct ethtool_ops com20020_ethtool_ops;
 
 /* The number of low I/O ports used by the card. */
 #define ARCNET_TOTAL_SIZE 8
diff --git a/drivers/net/arcnet/com20020_cs.c b/drivers/net/arcnet/com20020_cs.c
index cf607ffcf358..ae64f436fd54 100644
--- a/drivers/net/arcnet/com20020_cs.c
+++ b/drivers/net/arcnet/com20020_cs.c
@@ -233,6 +233,7 @@ static int com20020_config(struct pcmcia_device *link)
}
 
dev->irq = link->irq;
+   dev->ethtool_ops = &com20020_ethtool_ops;
 
ret = pcmcia_enable_device(link);
if (ret)
diff --git a/include/uapi/linux/if_arcnet.h b/include/uapi/linux/if_arcnet.h
index 683878036d76..790c0fa7386d 100644
--- a/include/uapi/linux/if_arcnet.h
+++ b/include/uapi/linux/if_arcnet.h
@@ -127,4 +127,10 @@ struct archdr {
} soft;
 };
 
+struct com20020_ethtool_regs {
+   __u8 status;
+   __u8 diag_register;
+   __u32 reconf_count;
+};
+
 #endif /* _LINUX_IF_ARCNET_H */
-- 
2.14.3

[RFC PATCH 1/3] arcnet: com20020: Add memory map of com20020

2018-05-05 Thread Andrea Greco

From: Andrea Greco 

Add support for com20022I/com20020, memory mapped chip version.
Support bus: Intel 80xx and Motorola 68xx.
Bus size: Only 8 bit bus size is supported.
Added related device tree bindings

Signed-off-by: Andrea Greco 
---
 .../devicetree/bindings/net/smsc-com20020.txt  |  23 +++
 drivers/net/arcnet/Kconfig |  12 +-
 drivers/net/arcnet/Makefile|   1 +
 drivers/net/arcnet/arcdevice.h |  27 ++-
 drivers/net/arcnet/com20020-membus.c   | 191 +
 drivers/net/arcnet/com20020.c  |   9 +-
 6 files changed, 253 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt
 create mode 100644 drivers/net/arcnet/com20020-membus.c

diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt 
b/Documentation/devicetree/bindings/net/smsc-com20020.txt
new file mode 100644
index ..39c5b19c55af
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
@@ -0,0 +1,23 @@
+SMSC com20020, com20022I
+
+timeout: Arcnet timeout, checkout datashet
+clockp: Clock Prescaler, checkout datashet
+clockm: Clock multiplier, checkout datasheet
+
+phy-reset-gpios: Chip reset ppin
+phy-irq-gpios: Chip irq pin
+
+com20020_A@0 {
+compatible = "smsc,com20020";
+
+   timeout = <0x3>;
+   backplane = <0x0>;
+
+   clockp = <0x0>;
+   clockm = <0x3>;
+
+   phy-reset-gpios = <&gpio3 21 GPIO_ACTIVE_LOW>;
+   phy-irq-gpios = <&gpio2 10 GPIO_ACTIVE_LOW>;
+
+   status = "okay";
+};
diff --git a/drivers/net/arcnet/Kconfig b/drivers/net/arcnet/Kconfig
index 39bd16f3f86d..d39faf45be1e 100644
--- a/drivers/net/arcnet/Kconfig
+++ b/drivers/net/arcnet/Kconfig
@@ -3,7 +3,7 @@
 #
 
 menuconfig ARCNET
-   depends on NETDEVICES && (ISA || PCI || PCMCIA)
+   depends on NETDEVICES
tristate "ARCnet support"
---help---
  If you have a network card of this type, say Y and check out the
@@ -129,5 +129,15 @@ config ARCNET_COM20020_CS
 
  To compile this driver as a module, choose M here: the module will be
  called com20020_cs.  If unsure, say N.
+config ARCNET_COM20020_MEMORY_BUS
+   bool "Support for COM20020 on external memory"
+   depends on ARCNET_COM20020 && !(ARCNET_COM20020_PCI || 
ARCNET_COM20020_ISA || ARCNET_COM20020_CS)
+   help
+ Say Y here if on your custom board mount com20020 or friends.
+
+ Com20022I support arcnet bus 10Mbitps.
+ This driver support only 8bit, and DMA is not supported is attached 
on your board at external interface bus.
+ Supported bus Intel80xx / Motorola 68xx.
+ This driver not work with other com20020 module: PCI or PCMCIA 
compiled as [M].
 
 endif # ARCNET
diff --git a/drivers/net/arcnet/Makefile b/drivers/net/arcnet/Makefile
index 53525e8ea130..19425c1e06f4 100644
--- a/drivers/net/arcnet/Makefile
+++ b/drivers/net/arcnet/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_ARCNET_COM20020) += com20020.o
 obj-$(CONFIG_ARCNET_COM20020_ISA) += com20020-isa.o
 obj-$(CONFIG_ARCNET_COM20020_PCI) += com20020-pci.o
 obj-$(CONFIG_ARCNET_COM20020_CS) += com20020_cs.o
+obj-$(CONFIG_ARCNET_COM20020_MEMORY_BUS) += com20020-membus.o
diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h
index d09b2b46ab63..16c608269cca 100644
--- a/drivers/net/arcnet/arcdevice.h
+++ b/drivers/net/arcnet/arcdevice.h
@@ -201,7 +201,7 @@ struct ArcProto {
void (*rx)(struct net_device *dev, int bufnum,
   struct archdr *pkthdr, int length);
int (*build_header)(struct sk_buff *skb, struct net_device *dev,
-   unsigned short ethproto, uint8_t daddr);
+   unsigned short ethproto, uint8_t daddr);
 
/* these functions return '1' if the skb can now be freed */
int (*prepare_tx)(struct net_device *dev, struct archdr *pkt,
@@ -326,9 +326,9 @@ struct arcnet_local {
void (*recontrigger) (struct net_device * dev, int enable);
 
void (*copy_to_card)(struct net_device *dev, int bufnum,
-int offset, void *buf, int count);
+int offset, void *buf, int count);
void (*copy_from_card)(struct net_device *dev, int bufnum,
-  int offset, void *buf, int count);
+  int offset, void *buf, int count);
} hw;
 
void __iomem *mem_start;/* pointer to ioremap'ed MMIO */
@@ -360,7 +360,7 @@ struct net_device *alloc_arcdev(const char *name);
 int arcnet_open(struct net_device *dev);
 int arcnet_close(struct net_device *dev);
 netdev_tx_t arcnet_send_packet(struct sk_buff *skb,
-  struct net_device *dev);
+  struct net_device *dev);
 void arcnet_timeout(struct

[RFC PATCH 2/3] arcnet: com20020: Fixup missing SLOWARB bit

2018-05-05 Thread Andrea Greco

From: Andrea Greco 

If com20020 clock is major of 40Mhz SLOWARB bit is requested.

Signed-off-by: Andrea Greco 
---
 drivers/net/arcnet/com20020.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index f09ea77dd6a8..abd32ed8ec9b 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -102,6 +102,10 @@ int com20020_check(struct net_device *dev)
lp->setup = lp->clockm ? 0 : (lp->clockp << 1);
lp->setup2 = (lp->clockm << 4) | 8;
 
+   // If clock is major of 40Mhz, SLOWARB bit must be set
+   if (lp->clockm > 1)
+   lp->setup2 |= SLOWARB;
+
/* CHECK: should we do this for SOHARD cards ? */
/* Enable P1Mode for backplane mode */
lp->setup = lp->setup | P1MODE;
-- 
2.14.3

Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> This "feature" is unused, undocumented, and untested and so
>> doesn't really belong.  If a use for the nulls marker
>> is found, all this code would need to be reviewed to
>> ensure it works as required.  It would be just as easy to
>> just add the code if/when it is needed instead.
>> 
>> This patch actually fixes a bug too.  The table resizing allows a
>> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
>> any growth beyond 2^27 is wasteful an ineffective.
>> 
>> This patch result in NULLS_MARKER(0) being used for all chains,
>> and leave the use of rht_is_a_null() to test for it.
>> 
>> Signed-off-by: NeilBrown 
>
> I disagree.  This is a fundamental requirement for the use of
> rhashtable in certain networking systems such as TCP/UDP.  So
> we know that there will be a use for this.

I can see no evidence that this is required for anything, as it isn't
use and I'm fairly sure that in it's current form - it cannot be used.

Based on my best guess about how you might intend to use it, I suspect
it would be simpler to store the address of the bucket head in the nuls
rather than the hash and a magic number.  This would make it just as
easy to detect when a search reaches the end of the wrong chain, which I
presume is the purpose.
I would find that useful myself - if the search would repeat when that
happened - as I could then use SLAB_TYPESAFE_BY_RCU.

Were we to take this approach, all the code I've removed here would
still need to be removed.

>
> As to the bug fix, please separate it out of the patch and resubmit.

I don't know how to do that.  I don't know what is safe to change
without "breaking" the nulls_base code because that code is undocumented and
unused, so unmaintainable.
In general the kernel has, I believe, a policy against keeping unused
interfaces.  While that isn't firm and universal, is seems to apply
particularly well to unusable interfaces.

Thanks,
NeilBrown

>
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

signature.asc
Description: PGP signature

Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> Rather than borrowing one of the bucket locks to
>> protect ->future_tbl updates, use cmpxchg().
>> This gives more freedom to change how bucket locking
>> is implemented.
>> 
>> Signed-off-by: NeilBrown 
>
> This looks nice.
>
>> -spin_unlock_bh(old_tbl->locks);
>> +rcu_assign_pointer(tmp, new_tbl);
>
> Do we need this barrier since cmpxchg is supposed to provide memory
> barrier semantics?

It's hard to find documentation even for what cmpxchg() is meant do, let
alone what barriers is provides, but there does seem to be something
hidden in Documentation/core-api/atomic_ops.rst which suggests full
barrier semantics if the comparison succeeds.  I'll replace the
rcu_assign_pointer with a comment saying why it isn't needed.

Thanks,
NeilBrown

>
>> +if (cmpxchg(&old_tbl->future_tbl, NULL, tmp) != NULL)
>> +return -EEXIST;
>
> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

signature.asc
Description: PGP signature

Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If two threads run nested_table_alloc() at the same time
>> they could both allocate a new table.
>> Best case is that one of them will never be freed, leaking memory.
>> Worst case is hat entry get stored there before it leaks,
>> and the are lost from the table.
>> 
>> So use cmpxchg to detect the race and free the unused table.
>> 
>> Fixes: da20420f83ea ("rhashtable: Add nested tables")
>> Cc: sta...@vger.kernel.org # 4.11+
>> Signed-off-by: NeilBrown 
>
> What about the spinlock that's meant to be held around this
> operation?

The spinlock protects 2 or more buckets.  The nested table contains at
least 512 buckets, maybe more.
It is quite possible for two insertions into 2 different buckets to both
get their spinlock and both try to instantiate the same nested table.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> print_ht in rhashtable_test calls rht_dereference() with neither
>> RCU protection or the mutex.  This triggers an RCU warning.
>> So take the mutex to silence the warning.
>> 
>> Signed-off-by: NeilBrown 
>
> I don't think the mutex is actually needed but since we don't
> have rht_dereference_raw and this is just test code:

I agrees there is no functional need for the mutex.  It just needed to
silence the warning, so that other warnings are easier to see.

>
> Acked-by: Herbert Xu 

thanks,
NeilBrown

> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature

Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> If the sequence:
>>obj = rhashtable_walk_next(iter);
>>rhashtable_walk_stop(iter);
>>rhashtable_remove_fast(ht, &obj->head, params);
>>rhashtable_walk_start(iter);
>> 
>>  races with another thread inserting or removing
>>  an object on the same hash chain, a subsequent
>>  rhashtable_walk_next() is not guaranteed to get the "next"
>>  object. It is possible that an object could be
>>  repeated, or missed.
>> 
>>  This can be made more reliable by keeping the objects in a hash chain
>>  sorted by memory address.  A subsequent rhashtable_walk_next()
>>  call can reliably find the correct position in the list, and thus
>>  find the 'next' object.
>> 
>>  It is not possible (certainly not so easy) to achieve this with an
>>  rhltable as keeping the hash chain in order is not so easy.  When the
>>  first object with a given key is removed, it is replaced in the chain
>>  with the next object with the same key, and the address of that
>>  object may not be correctly ordered.
>>  No current user of rhltable_walk_enter() calls
>>  rhashtable_walk_start() more than once, so no current code
>>  could benefit from a more reliable walk of rhltables.
>> 
>>  This patch only attempts to improve walks for rhashtables.
>>  - a new object is always inserted after the last object with a
>>smaller address, or at the start
>>  - when rhashtable_walk_start() is called, it records that 'p' is not
>>'safe', meaning that it cannot be dereferenced.  The revalidation
>>that was previously done here is moved to rhashtable_walk_next()
>>  - when rhashtable_walk_next() is called while p is not NULL and not
>>safe, it walks the chain looking for the first object with an
>>address greater than p and returns that.  If there is none, it moves
>>to the next hash chain.
>> 
>> Signed-off-by: NeilBrown 
>
> I'm a bit torn on this.  On the hand this is definitely an improvement
> over the status quo.  On the other this does not work on rhltable and
> we do have a way of fixing it for both rhashtable and rhltable.

Do we?  How could we fix it for both rhashtable and rhltable?

Thanks,
NeilBrown



signature.asc
Description: PGP signature

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread NeilBrown

On Sat, May 05 2018, Herbert Xu wrote:

> On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
>> rhashtable_try_insert() currently hold a lock on the bucket in
>> the first table, while also locking buckets in subsequent tables.
>> This is unnecessary and looks like a hold-over from some earlier
>> version of the implementation.
>> 
>> As insert and remove always lock a bucket in each table in turn, and
>> as insert only inserts in the final table, there cannot be any races
>> that are not covered by simply locking a bucket in each table in turn.
>> 
>> When an insert call reaches that last table it can be sure that there
>> is no match entry in any other table as it has searched them all, and
>> insertion never happens anywhere but in the last table.  The fact that
>> code tests for the existence of future_tbl while holding a lock on
>> the relevant bucket ensures that two threads inserting the same key
>> will make compatible decisions about which is the "last" table.
>> 
>> This simplifies the code and allows the ->rehash field to be
>> discarded.
>> We still need a way to ensure that a dead bucket_table is never
>> re-linked by rhashtable_walk_stop().  This can be achieved by
>> setting the ->size to 1.  This still allows lookup code to work (and
>> correctly not find anything) but can never happen on an active bucket
>> table (as the minimum size is 4).
>> 
>> Signed-off-by: NeilBrown 
>
> I'm not convinced this is safe.  There can be multiple tables
> in existence.  Unless you hold the lock on the first table, what
> is to stop two parallel inserters each from inserting into their
> "last" tables which may not be the same table?

The insert function must (and does) take the lock on the bucket before
testing if there is a "next" table.
If one inserter finds that it has locked the "last" table (because there
is no next) and successfully inserts, then the other inserter cannot
have locked that table yet, else it would have inserted.  When it does,
it will find what the first inserter inserted. 

Thanks,
NeilBrown

>
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


signature.asc
Description: PGP signature

Re: WARNING in kernfs_add_one

2018-05-05 Thread Greg KH

On Sat, May 05, 2018 at 10:43:45AM -0700, Eric Dumazet wrote:
> 
> 
> On 05/05/2018 09:40 AM, Greg KH wrote:
> > On Sat, May 05, 2018 at 08:47:02AM -0700, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:8fb11a9a8d51 net/ipv6: rename rt6_next to fib6_next
> >> git tree:   net-next
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=14b2723780
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=c416c61f3cd96be
> >> dashboard link: 
> >> https://syzkaller.appspot.com/bug?extid=df47f81c226b31d89fb1
> >> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=172fb3e780
> >> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16552e5780
> >>
> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> Reported-by: syzbot+df47f81c226b31d89...@syzkaller.appspotmail.com
> >>
> >> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534
> >> R10:  R11: 0246 R12: 
> >> R13: 0006 R14:  R15: 
> >> [ cut here ]
> >> kernfs: ns required in 'ieee80211' for 'phy3'
> > 
> > That's interesting, this looks like a netfilter bug (adding netdev to
> > the report here.)
> 
> 
> I do not see anything netfilter related here.
> 
> More likely wireless territory

Ugh, that's what I get for writing emails before coffee in the
morning...

Yes, you are right, this looks like a wireless issue.

Now cc: linux-wireless.

> > Yes, we can "tone down" the kernfs warning to just be an error message
> > in the log, but there might be something worse going on here.
> > 
> > Network developers, any idea?  Rest of the callback chain is here:
> > 
> > 
> >> WARNING: CPU: 0 PID: 4538 at fs/kernfs/dir.c:759 kernfs_add_one+0x406/0x4d0
> >> fs/kernfs/dir.c:758
> >> Kernel panic - not syncing: panic_on_warn set ...
> >>
> >> CPU: 0 PID: 4538 Comm: syz-executor486 Not tainted 4.17.0-rc3+ #33
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >> Google 01/01/2011
> >> Call Trace:
> >>  __dump_stack lib/dump_stack.c:77 [inline]
> >>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
> >>  panic+0x22f/0x4de kernel/panic.c:184
> >>  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
> >>  report_bug+0x252/0x2d0 lib/bug.c:186
> >>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
> >>  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
> >>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> >>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
> >> RIP: 0010:kernfs_add_one+0x406/0x4d0 fs/kernfs/dir.c:758
> >> RSP: 0018:8801ca9eece0 EFLAGS: 00010286
> >> RAX: 002d RBX: 87d5cee0 RCX: 8160ba7d
> >> RDX:  RSI: 81610731 RDI: 8801ca9ee840
> >> RBP: 8801ca9eed20 R08: 8801d9538500 R09: 0006
> >> R10: 8801d9538500 R11:  R12: 8801ad1cb6c0
> >> R13: 885da640 R14: 0020 R15: 
> >>  kernfs_create_link+0x112/0x180 fs/kernfs/symlink.c:41
> >>  sysfs_do_create_link_sd.isra.2+0x90/0x130 fs/sysfs/symlink.c:43
> >>  sysfs_do_create_link fs/sysfs/symlink.c:79 [inline]
> >>  sysfs_create_link+0x65/0xc0 fs/sysfs/symlink.c:91
> >>  device_add_class_symlinks drivers/base/core.c:1612 [inline]
> >>  device_add+0x7a0/0x16d0 drivers/base/core.c:1810
> >>  wiphy_register+0x178a/0x2430 net/wireless/core.c:806
> >>  ieee80211_register_hw+0x13cd/0x35d0 net/mac80211/main.c:1047
> >>  mac80211_hwsim_new_radio+0x1d9b/0x3410
> >> drivers/net/wireless/mac80211_hwsim.c:2772
> >>  hwsim_new_radio_nl+0x7a7/0xa60 drivers/net/wireless/mac80211_hwsim.c:3246
> >>  genl_family_rcv_msg+0x889/0x1120 net/netlink/genetlink.c:599
> >>  genl_rcv_msg+0xc6/0x170 net/netlink/genetlink.c:624
> >>  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
> >>  genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
> >>  netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
> >>  netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
> >>  netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
> >>  sock_sendmsg_nosec net/socket.c:629 [inline]
> >>  sock_sendmsg+0xd5/0x120 net/socket.c:639
> >>  ___sys_sendmsg+0x805/0x940 net/socket.c:2117
> >>  __sys_sendmsg+0x115/0x270 net/socket.c:2155
> >>  __do_sys_sendmsg net/socket.c:2164 [inline]
> >>  __se_sys_sendmsg net/socket.c:2162 [inline]
> >>  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162
> >>  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> >>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >> RIP: 0033:0x4404c9
> >> RSP: 002b:7fff808f3e08 EFLAGS: 0246 ORIG_RAX: 002e
> >> RAX: ffda RBX:  RCX: 004404c9
> >> RDX:  RSI: 20b3dfc8 RDI: 0005
> >> RBP: 7fff808f3e10 R08: 0002 R09: 7fff80003534

[PATCH bpf-next v5 0/4] Hash support for sock

2018-05-05 Thread John Fastabend

In the original sockmap implementation we got away with using an
array similar to devmap. However, unlike devmap where an ifindex
has a nice 1:1 function into the map we have found some use cases
with sockets that need to be referenced using longer keys.

This series adds support for a sockhash map reusing as much of
the sockmap code as possible. I made the decision to add sockhash
specific helpers vs trying to generalize the existing helpers
because (a) they have sockmap in the name and (b) the keys are
different types. I prefer to be explicit here rather than play
type games or do something else tricky.

To test this we duplicate all the sockmap testing except swap out
the sockmap with a sockhash.

v2: fix file stats and add v2 tag
v3: move tool updates into test patch, move bpftool updates into
its own patch, and fixup the test patch stats to catch the
renamed file and provide only diffs +/- on that.
v4: Add documentation to UAPI bpf.h
v5: Add documentation to tools UAPI bpf.h

Just a note I pushed Dave's Acks through v4 into v5 due to small
size of change.

John Fastabend (4):
  bpf: sockmap, refactor sockmap routines to work with hashmap
  bpf: sockmap, add hash map support
  bpf: bpftool, support for sockhash
  bpf: selftest additions for SOCKHASH

 include/linux/bpf.h|   8 +
 include/linux/bpf_types.h  |   1 +
 include/linux/filter.h |   3 +-
 include/net/tcp.h  |   3 +-
 include/uapi/linux/bpf.h   |   6 +-
 kernel/bpf/core.c  |   1 +
 kernel/bpf/sockmap.c   | 638 ++---
 kernel/bpf/verifier.c  |  14 +-
 net/core/filter.c  |  89 ++-
 tools/bpf/bpftool/map.c|   1 +
 tools/include/uapi/linux/bpf.h |   6 +-
 tools/testing/selftests/bpf/Makefile   |   3 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |   4 +
 tools/testing/selftests/bpf/test_sockmap.c |  27 +-
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   |   6 +-
 15 files changed, 695 insertions(+), 115 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => 
test_sockmap_kern.h} (98%)

-- 
1.9.1

[PATCH bpf-next v5 3/4] bpf: selftest additions for SOCKHASH

2018-05-05 Thread John Fastabend

This runs existing SOCKMAP tests with SOCKHASH map type. To do this
we push programs into include file and build two BPF programs. One
for SOCKHASH and one for SOCKMAP.

We then run the entire test suite with each type.

Signed-off-by: John Fastabend 
Acked-by: David S. Miller 
---
 tools/include/uapi/linux/bpf.h | 52 +-
 tools/testing/selftests/bpf/Makefile   |  2 +-
 tools/testing/selftests/bpf/test_sockhash_kern.c   |  4 ++
 tools/testing/selftests/bpf/test_sockmap.c | 27 ---
 .../{test_sockmap_kern.c => test_sockmap_kern.h}   | 10 ++---
 5 files changed, 81 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 rename tools/testing/selftests/bpf/{test_sockmap_kern.c => 
test_sockmap_kern.h} (97%)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 83a95ae..f3d8d05 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1825,6 +1826,52 @@ struct bpf_stack_build_id {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
+ * Description
+ * Add an entry to, or update a sockhash *map* referencing sockets.
+ * The *skops* is used as a new value for the entry associated to
+ * *key*. *flags* is one of:
+ *
+ * **BPF_NOEXIST**
+ * The entry for *key* must not exist in the map.
+ * **BPF_EXIST**
+ * The entry for *key* must already exist in the map.
+ * **BPF_ANY**
+ * No condition on the existence of the entry for *key*.
+ *
+ * If the *map* has eBPF programs (parser and verdict), those will
+ * be inherited by the socket being added. If the socket is
+ * already attached to eBPF programs, this results in an error.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, 
void *key, u64 flags)
+ * Description
+ * This helper is used in programs implementing policies at the
+ * socket level. If the message *msg* is allowed to pass (i.e. if
+ * the verdict eBPF program returns **SK_PASS**), redirect it to
+ * the socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ * egress interfaces can be used for redirection. The
+ * **BPF_F_INGRESS** value in *flags* is used to make the
+ * distinction (ingress path is selected if the flag is present,
+ * egress path otherwise). This is the only flag supported for now.
+ * Return
+ * **SK_PASS** on success, or **SK_DROP** on error.
+ *
+ * int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void 
*key, u64 flags)
+ * Description
+ * This helper is used in programs implementing policies at the
+ * skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
+ * if the verdeict eBPF program returns **SK_PASS**), redirect it
+ * to the socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ * egress interfaces can be used for redirection. The
+ * **BPF_F_INGRESS** value in *flags* is used to make the
+ * distinction (ingress path is selected if the flag is present,
+ * egress otherwise). This is the only flag supported for now.
+ * Return
+ * **SK_PASS** on success, or **SK_DROP** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -1895,7 +1942,10 @@ struct bpf_stack_build_id {
FN(xdp_adjust_tail),\
FN(skb_get_xfrm_state), \
FN(get_stack),  \
-   FN(skb_load_bytes_relative),
+   FN(skb_load_bytes_relative),\
+   FN(sock_hash_update),   \
+   FN(msg_redirect_hash),  \
+   FN(sk_redirect_hash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 9d76218..28316f1 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
s

[PATCH bpf-next v5 2/4] bpf: sockmap, add hash map support

2018-05-05 Thread John Fastabend

Sockmap is currently backed by an array and enforces keys to be
four bytes. This works well for many use cases and was originally
modeled after devmap which also uses four bytes keys. However,
this has become limiting in larger use cases where a hash would
be more appropriate. For example users may want to use the 5-tuple
of the socket as the lookup key.

To support this add hash support.

Signed-off-by: John Fastabend 
Acked-by: David S. Miller 
---
 include/linux/bpf.h   |   8 +
 include/linux/bpf_types.h |   1 +
 include/uapi/linux/bpf.h  |  52 -
 kernel/bpf/core.c |   1 +
 kernel/bpf/sockmap.c  | 494 --
 kernel/bpf/verifier.c |  14 +-
 net/core/filter.c |  58 ++
 7 files changed, 610 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 321969d..6dbe1aa 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -668,6 +668,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map 
*map)
 
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && 
defined(CONFIG_INET)
 struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
+struct sock  *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
 #else
 static inline struct sock  *__sock_map_lookup_elem(struct bpf_map *map, u32 
key)
@@ -675,6 +676,12 @@ static inline struct sock  *__sock_map_lookup_elem(struct 
bpf_map *map, u32 key)
return NULL;
 }
 
+static inline struct sock  *__sock_hash_lookup_elem(struct bpf_map *map,
+   void *key)
+{
+   return NULL;
+}
+
 static inline int sock_map_prog(struct bpf_map *map,
struct bpf_prog *prog,
u32 type)
@@ -724,6 +731,7 @@ static inline void __xsk_map_flush(struct bpf_map *map)
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
+extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b32..b67f879 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -47,6 +47,7 @@
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 #if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4e..98c0e8c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -117,6 +117,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
+   BPF_MAP_TYPE_SOCKHASH,
 };
 
 enum bpf_prog_type {
@@ -1826,6 +1827,52 @@ struct bpf_stack_build_id {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map 
*map, void *key, u64 flags)
+ * Description
+ * Add an entry to, or update a sockhash *map* referencing sockets.
+ * The *skops* is used as a new value for the entry associated to
+ * *key*. *flags* is one of:
+ *
+ * **BPF_NOEXIST**
+ * The entry for *key* must not exist in the map.
+ * **BPF_EXIST**
+ * The entry for *key* must already exist in the map.
+ * **BPF_ANY**
+ * No condition on the existence of the entry for *key*.
+ *
+ * If the *map* has eBPF programs (parser and verdict), those will
+ * be inherited by the socket being added. If the socket is
+ * already attached to eBPF programs, this results in an error.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, 
void *key, u64 flags)
+ * Description
+ * This helper is used in programs implementing policies at the
+ * socket level. If the message *msg* is allowed to pass (i.e. if
+ * the verdict eBPF program returns **SK_PASS**), redirect it to
+ * the socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
+ * egress interfaces can be used for redirection. The
+ * **BPF_F_INGRESS** value in *flags* is used to make the
+ * distinction (ingress path is selected if the flag is present,
+ * egress path otherwise). This is the only flag supported for now.
+ *

[PATCH bpf-next v5 4/4] bpf: bpftool, support for sockhash

2018-05-05 Thread John Fastabend

This adds the SOCKHASH map type to bpftools so that we get correct
pretty printing.

Signed-off-by: John Fastabend 
Acked-by: David S. Miller 
---
 tools/bpf/bpftool/map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index af6766e..097b1a5 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -66,6 +66,7 @@
[BPF_MAP_TYPE_DEVMAP]   = "devmap",
[BPF_MAP_TYPE_SOCKMAP]  = "sockmap",
[BPF_MAP_TYPE_CPUMAP]   = "cpumap",
+   [BPF_MAP_TYPE_SOCKHASH] = "sockhash",
 };
 
 static bool map_is_per_cpu(__u32 type)
-- 
1.9.1

[PATCH bpf-next v5 1/4] bpf: sockmap, refactor sockmap routines to work with hashmap

2018-05-05 Thread John Fastabend

This patch only refactors the existing sockmap code. This will allow
much of the psock initialization code path and bpf helper codes to
work for both sockmap bpf map types that are backed by an array, the
currently supported type, and the new hash backed bpf map type
sockhash.

Most the fallout comes from three changes,

  - Pushing bpf programs into an independent structure so we
can use it from the htab struct in the next patch.
  - Generalizing helpers to use void *key instead of the hardcoded
u32.
  - Instead of passing map/key through the metadata we now do
the lookup inline. This avoids storing the key in the metadata
which will be useful when keys can be longer than 4 bytes. We
rename the sk pointers to sk_redir at this point as well to
avoid any confusion between the current sk pointer and the
redirect pointer sk_redir.

Signed-off-by: John Fastabend 
Acked-by: David S. Miller 
---
 include/linux/filter.h |   3 +-
 include/net/tcp.h  |   3 +-
 kernel/bpf/sockmap.c   | 148 +
 net/core/filter.c  |  31 +++
 4 files changed, 98 insertions(+), 87 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index da7e165..9dbcb9d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -515,9 +515,8 @@ struct sk_msg_buff {
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
struct sk_buff *skb;
struct list_head list;
 };
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 833154e..089185a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -814,9 +814,8 @@ struct tcp_skb_cb {
 #endif
} header;   /* For incoming skbs */
struct {
-   __u32 key;
__u32 flags;
-   struct bpf_map *map;
+   struct sock *sk_redir;
void *data_end;
} bpf;
};
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 801273b..4eef5b1 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -48,14 +48,18 @@
 #define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
 
-struct bpf_stab {
-   struct bpf_map map;
-   struct sock **sock_map;
+struct bpf_sock_progs {
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
 };
 
+struct bpf_stab {
+   struct bpf_map map;
+   struct sock **sock_map;
+   struct bpf_sock_progs progs;
+};
+
 enum smap_psock_state {
SMAP_TX_RUNNING,
 };
@@ -460,7 +464,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff 
*md)
 static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
 {
return ((_rc == SK_PASS) ?
-  (md->map ? __SK_REDIRECT : __SK_PASS) :
+  (md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
   __SK_DROP);
 }
 
@@ -1091,7 +1095,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 * when we orphan the skb so that we don't have the possibility
 * to reference a stale map.
 */
-   TCP_SKB_CB(skb)->bpf.map = NULL;
+   TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
skb->sk = psock->sock;
bpf_compute_data_pointers(skb);
preempt_disable();
@@ -1101,7 +1105,7 @@ static int smap_verdict_func(struct smap_psock *psock, 
struct sk_buff *skb)
 
/* Moving return codes from UAPI namespace into internal namespace */
return rc == SK_PASS ?
-   (TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
+   (TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP;
 }
 
@@ -1371,7 +1375,6 @@ static int smap_init_sock(struct smap_psock *psock,
 }
 
 static void smap_init_progs(struct smap_psock *psock,
-   struct bpf_stab *stab,
struct bpf_prog *verdict,
struct bpf_prog *parse)
 {
@@ -1449,14 +1452,13 @@ static void smap_gc_work(struct work_struct *w)
kfree(psock);
 }
 
-static struct smap_psock *smap_init_psock(struct sock *sock,
- struct bpf_stab *stab)
+static struct smap_psock *smap_init_psock(struct sock *sock, int node)
 {
struct smap_psock *psock;
 
psock = kzalloc_node(sizeof(struct smap_psock),
 GFP_ATOMIC | __GFP_NOWARN,
-stab->map.numa_node);
+node);
if (!psock)
return ERR_PTR(-ENOMEM);
 
@@ -1661,40 +1663,26 @@ static int sock_map_delete_elem(struct bpf_map *map, 
void *key)
  *  - sock_map must use READ_ONCE and (cmp)xchg operations
  *  - BPF verdict/parse programs must use R

BUG: please report to d...@vger.kernel.org => prev = 0, last = 0 at net/dccp/ccids/lib/packet_history.c:LINE/tfrc_rx_his

2018-05-05 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13d5de4780
kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
dashboard link: https://syzkaller.appspot.com/bug?extid=99858724c0ba555a12ea
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=170afde780
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=141b4be780

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+99858724c0ba555a1...@syzkaller.appspotmail.com

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
BUG: please report to d...@vger.kernel.org => prev = 0, last = 0 at  
net/dccp/ccids/lib/packet_history.c:425/tfrc_rx_hist_sample_rtt()

CPU: 0 PID: 4495 Comm: syz-executor551 Not tainted 4.17.0-rc3+ #34
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 tfrc_rx_hist_sample_rtt.cold.3+0x54/0x5c  
net/dccp/ccids/lib/packet_history.c:422

 ccid3_hc_rx_packet_recv+0x5c8/0xed0 net/dccp/ccids/ccid3.c:765
 ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
 dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
 dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
 dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __sk_receive_skb+0x3a2/0xd60 net/core/sock.c:513
 dccp_v4_rcv+0x10e5/0x1f3f net/dccp/ipv4.c:875
 ip_local_deliver_finish+0x2e3/0xd80 net/ipv4/ip_input.c:215
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_local_deliver+0x1e1/0x720 net/ipv4/ip_input.c:256
 dst_input include/net/dst.h:450 [inline]
 ip_rcv_finish+0x81b/0x2200 net/ipv4/ip_input.c:396
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_rcv+0xb70/0x143d net/ipv4/ip_input.c:492
 __netif_receive_skb_core+0x26f5/0x3630 net/core/dev.c:4592
 __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4657
 process_backlog+0x219/0x760 net/core/dev.c:5337
 napi_poll net/core/dev.c:5735 [inline]
 net_rx_action+0x7b7/0x1930 net/core/dev.c:5801
 __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1046
 
 do_softirq.part.17+0x14d/0x190 kernel/softirq.c:329
 do_softirq arch/x86/include/asm/preempt.h:23 [inline]
 __local_bh_enable_ip+0x1ec/0x230 kernel/softirq.c:182
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:728 [inline]
 ip_finish_output2+0xab2/0x1840 net/ipv4/ip_output.c:231
 ip_finish_output+0x828/0xf80 net/ipv4/ip_output.c:317
 NF_HOOK_COND include/linux/netfilter.h:277 [inline]
 ip_output+0x21b/0x850 net/ipv4/ip_output.c:405
 dst_output include/net/dst.h:444 [inline]
 ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
 ip_queue_xmit+0x9d7/0x1f70 net/ipv4/ip_output.c:504
 dccp_transmit_skb+0x999/0x12e0 net/dccp/output.c:142
 dccp_xmit_packet+0x250/0x790 net/dccp/output.c:281
 dccp_write_xmit+0x190/0x1f0 net/dccp/output.c:363
 dccp_sendmsg+0x8c7/0x1020 net/dccp/proto.c:818
 inet_sendmsg+0x19f/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 ___sys_sendmsg+0x525/0x940 net/socket.c:2117
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2212
 __do_sys_sendmmsg net/socket.c:2241 [inline]
 __se_sys_sendmmsg net/socket.c:2238 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2238
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x445d09
RSP: 002b:7f3c7eff5d88 EFLAGS: 0293 ORIG_RAX: 0133
RAX: ffda RBX: 006dac40 RCX: 00445d09
RDX: 0001 RSI: 00


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.

Re: [PATCH net] net/tls: Don't recursively call push_record during tls_write_space callbacks

2018-05-05 Thread Andre Tomt


On 01. mai 2018 22:05, Dave Watson wrote:

It is reported that in some cases, write_space may be called in
do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again:

[  660.468802]  ? do_tcp_sendpages+0x8d/0x580
[  660.468826]  ? tls_push_sg+0x74/0x130 [tls]
[  660.468852]  ? tls_push_record+0x24a/0x390 [tls]
[  660.468880]  ? tls_write_space+0x6a/0x80 [tls]
...

tls_push_sg already does a loop over all sending sg's, so ignore
any tls_write_space notifications until we are done sending.
We then have to call the previous write_space to wake up
poll() waiters after we are done with the send loop.

Reported-by: Andre Tomt 
Signed-off-by: Dave Watson 


Unfortunately it seems like this patch has a bug, while it fixed the 
kernel crashing it is causing some connections in my testbed to stall.


Making sure ctx->in_tcp_sendpages is also cleared before the return ret 
within the while(1) loop seems to fix it for me.



diff -Naurp a/net/tls/tls_main.c b/net/tls/tls_main.c
--- a/net/tls/tls_main.c2018-05-06 02:23:41.638597066 +0200
+++ b/net/tls/tls_main.c2018-05-06 01:59:14.378568139 +0200
@@ -135,6 +135,7 @@ retry:
offset -= sg->offset;
ctx->partially_sent_offset = offset;
ctx->partially_sent_record = (void *)sg;
+   ctx->in_tcp_sendpages = false;
return ret;
}

[PATCH net-next 0/4] qed*: Add support for new multi partitioning modes.

2018-05-05 Thread Sudarsana Reddy Kalluru

From: Sudarsana Reddy Kalluru 

The patch series simplifies the multi function (MF) mode implementation of
qed/qede drivers, and adds support for new MF modes.
 
Please consider applying it to net-next branch.

Sudarsana Reddy Kalluru (4):
  qed*: Refactor mf_mode to consist of bits.
  qed: Remove unused data member 'is_mf_default'.
  qed: Add support for multi function mode with 802.1ad tagging.
  qed: Add support for Unified Fabric Port.

 drivers/net/ethernet/qlogic/qed/qed.h |  61 +++-
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c|  14 ++-
 drivers/net/ethernet/qlogic/qed/qed_dev.c | 113 +++---
 drivers/net/ethernet/qlogic/qed/qed_fcoe.c|   3 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h |  28 ++
 drivers/net/ethernet/qlogic/qed/qed_ll2.c |  46 ++---
 drivers/net/ethernet/qlogic/qed/qed_main.c|   4 +-
 drivers/net/ethernet/qlogic/qed/qed_mcp.c |  78 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |   8 ++
 drivers/net/ethernet/qlogic/qed/qed_sp.h  |  12 ++-
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  76 ---
 drivers/net/ethernet/qlogic/qede/qede_main.c  |   4 +-
 drivers/scsi/qedf/qedf_fip.c  |   7 +-
 drivers/scsi/qedf/qedf_main.c |   2 +-
 drivers/scsi/qedi/qedi_iscsi.c|   2 +-
 include/linux/qed/qed_if.h|   3 +-
 include/linux/qed/qed_ll2_if.h|  10 +-
 17 files changed, 389 insertions(+), 82 deletions(-)

-- 
1.8.3.1

[PATCH net-next 1/4] qed*: Refactor mf_mode to consist of bits.

2018-05-05 Thread Sudarsana Reddy Kalluru

`mf_mode' field indicates the multi-partitioning mode the device is
configured to. This method doesn't scale very well, adding a new MF mode
requires going over all the existing conditions, and deciding whether those
are needed for the new mode or not.
The patch defines a set of bit-fields for modes which are derived according
to the mode info shared by the MFW and all the configuration would be made
according to those. To add a new mode, there would be a single place where
we'll need to go and choose which bits apply and which don't.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h | 41 ---
 drivers/net/ethernet/qlogic/qed/qed_dev.c | 39 +++--
 drivers/net/ethernet/qlogic/qed/qed_ll2.c |  6 ++--
 drivers/net/ethernet/qlogic/qed/qed_main.c|  6 ++--
 drivers/net/ethernet/qlogic/qed/qed_sp.h  |  3 +-
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c | 16 +++--
 drivers/net/ethernet/qlogic/qede/qede_main.c  |  4 +--
 include/linux/qed/qed_if.h|  2 +-
 8 files changed, 71 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index e07460a..c8f3507 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -439,6 +439,41 @@ struct qed_fw_data {
u32 init_ops_size;
 };
 
+enum qed_mf_mode_bit {
+   /* Supports PF-classification based on tag */
+   QED_MF_OVLAN_CLSS,
+
+   /* Supports PF-classification based on MAC */
+   QED_MF_LLH_MAC_CLSS,
+
+   /* Supports PF-classification based on protocol type */
+   QED_MF_LLH_PROTO_CLSS,
+
+   /* Requires a default PF to be set */
+   QED_MF_NEED_DEF_PF,
+
+   /* Allow LL2 to multicast/broadcast */
+   QED_MF_LL2_NON_UNICAST,
+
+   /* Allow Cross-PF [& child VFs] Tx-switching */
+   QED_MF_INTER_PF_SWITCH,
+
+   /* Unified Fabtic Port support enabled */
+   QED_MF_UFP_SPECIFIC,
+
+   /* Disable Accelerated Receive Flow Steering (aRFS) */
+   QED_MF_DISABLE_ARFS,
+
+   /* Use vlan for steering */
+   QED_MF_8021Q_TAGGING,
+
+   /* Use stag for steering */
+   QED_MF_8021AD_TAGGING,
+
+   /* Allow DSCP to TC mapping */
+   QED_MF_DSCP_TO_TC_MAP,
+};
+
 enum BAR_ID {
BAR_ID_0,   /* used for GRC */
BAR_ID_1/* Used for doorbells */
@@ -669,10 +704,8 @@ struct qed_dev {
u8  num_funcs_in_port;
 
u8  path_id;
-   enum qed_mf_modemf_mode;
-#define IS_MF_DEFAULT(_p_hwfn)  (((_p_hwfn)->cdev)->mf_mode == QED_MF_DEFAULT)
-#define IS_MF_SI(_p_hwfn)   (((_p_hwfn)->cdev)->mf_mode == QED_MF_NPAR)
-#define IS_MF_SD(_p_hwfn)   (((_p_hwfn)->cdev)->mf_mode == QED_MF_OVLAN)
+
+   unsigned long   mf_bits;
 
int pcie_width;
int pcie_speed;
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index d2ad5e9..9b07d7f 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1149,18 +1149,10 @@ static int qed_calc_hw_mode(struct qed_hwfn *p_hwfn)
return -EINVAL;
}
 
-   switch (p_hwfn->cdev->mf_mode) {
-   case QED_MF_DEFAULT:
-   case QED_MF_NPAR:
-   hw_mode |= 1 << MODE_MF_SI;
-   break;
-   case QED_MF_OVLAN:
+   if (test_bit(QED_MF_OVLAN_CLSS, &p_hwfn->cdev->mf_bits))
hw_mode |= 1 << MODE_MF_SD;
-   break;
-   default:
-   DP_NOTICE(p_hwfn, "Unsupported MF mode, init as DEFAULT\n");
+   else
hw_mode |= 1 << MODE_MF_SI;
-   }
 
hw_mode |= 1 << MODE_ASIC;
 
@@ -1557,7 +1549,6 @@ static int qed_hw_init_pf(struct qed_hwfn *p_hwfn,
 
/* send function start command */
rc = qed_sp_pf_start(p_hwfn, p_ptt, p_tunn,
-p_hwfn->cdev->mf_mode,
 allow_npar_tx_switch);
if (rc) {
DP_NOTICE(p_hwfn, "Function start ramrod failed\n");
@@ -2651,17 +2642,25 @@ static int qed_hw_get_nvm_info(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
 
switch (mf_mode) {
case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
-   p_hwfn->cdev->mf_mode = QED_MF_OVLAN;
+   p_hwfn->cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
break;
case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
-   p_hwfn->cdev->mf_mode = QED_MF_NPAR;
+   p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+   BIT(QED_MF_LLH_PROTO_CLSS) |
+

[PATCH net-next 3/4] qed: Add support for multi function mode with 802.1ad tagging.

2018-05-05 Thread Sudarsana Reddy Kalluru

The patch adds support for new Multi function mode wherein the traffic
classification is done based on the 802.1ad tagging and the outer vlan tag
provided by the management firmware.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_dev.c | 64 ---
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c |  5 ++
 2 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 9b07d7f..95d00cb 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1668,6 +1668,18 @@ int qed_hw_init(struct qed_dev *cdev, struct 
qed_hw_init_params *p_params)
if (rc)
return rc;
 
+   if (IS_PF(cdev) && test_bit(QED_MF_8021AD_TAGGING,
+   &cdev->mf_bits)) {
+   STORE_RT_REG(p_hwfn, PRS_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+ETH_P_8021AD);
+   STORE_RT_REG(p_hwfn, NIG_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+ETH_P_8021AD);
+   STORE_RT_REG(p_hwfn, PBF_REG_TAG_ETHERTYPE_0_RT_OFFSET,
+ETH_P_8021AD);
+   STORE_RT_REG(p_hwfn, DORQ_REG_TAG1_ETHERTYPE_RT_OFFSET,
+ETH_P_8021AD);
+   }
+
qed_fill_load_req_params(&load_req_params,
 p_params->p_drv_load_params);
rc = qed_mcp_load_req(p_hwfn, p_hwfn->p_main_ptt,
@@ -2630,39 +2642,51 @@ static int qed_hw_get_nvm_info(struct qed_hwfn *p_hwfn, 
struct qed_ptt *p_ptt)
   link->pause.autoneg,
   p_caps->default_eee, p_caps->eee_lpi_timer);
 
-   /* Read Multi-function information from shmem */
-   addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
-  offsetof(struct nvm_cfg1, glob) +
-  offsetof(struct nvm_cfg1_glob, generic_cont0);
+   if (IS_LEAD_HWFN(p_hwfn)) {
+   struct qed_dev *cdev = p_hwfn->cdev;
 
-   generic_cont0 = qed_rd(p_hwfn, p_ptt, addr);
+   /* Read Multi-function information from shmem */
+   addr = MCP_REG_SCRATCH + nvm_cfg1_offset +
+  offsetof(struct nvm_cfg1, glob) +
+  offsetof(struct nvm_cfg1_glob, generic_cont0);
 
-   mf_mode = (generic_cont0 & NVM_CFG1_GLOB_MF_MODE_MASK) >>
- NVM_CFG1_GLOB_MF_MODE_OFFSET;
+   generic_cont0 = qed_rd(p_hwfn, p_ptt, addr);
 
-   switch (mf_mode) {
-   case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
-   p_hwfn->cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
-   break;
-   case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
-   p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+   mf_mode = (generic_cont0 & NVM_CFG1_GLOB_MF_MODE_MASK) >>
+ NVM_CFG1_GLOB_MF_MODE_OFFSET;
+
+   switch (mf_mode) {
+   case NVM_CFG1_GLOB_MF_MODE_MF_ALLOWED:
+   cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS);
+   break;
+   case NVM_CFG1_GLOB_MF_MODE_BD:
+   cdev->mf_bits = BIT(QED_MF_OVLAN_CLSS) |
+   BIT(QED_MF_LLH_PROTO_CLSS) |
+   BIT(QED_MF_8021AD_TAGGING);
+   break;
+   case NVM_CFG1_GLOB_MF_MODE_NPAR1_0:
+   cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
BIT(QED_MF_LLH_PROTO_CLSS) |
BIT(QED_MF_LL2_NON_UNICAST) |
BIT(QED_MF_INTER_PF_SWITCH);
-   break;
-   case NVM_CFG1_GLOB_MF_MODE_DEFAULT:
-   p_hwfn->cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
+   break;
+   case NVM_CFG1_GLOB_MF_MODE_DEFAULT:
+   cdev->mf_bits = BIT(QED_MF_LLH_MAC_CLSS) |
BIT(QED_MF_LLH_PROTO_CLSS) |
BIT(QED_MF_LL2_NON_UNICAST);
-   if (QED_IS_BB(p_hwfn->cdev))
-   p_hwfn->cdev->mf_bits |= BIT(QED_MF_NEED_DEF_PF);
-   break;
+   if (QED_IS_BB(p_hwfn->cdev))
+   cdev->mf_bits |= BIT(QED_MF_NEED_DEF_PF);
+   break;
+   }
+
+   DP_INFO(p_hwfn, "Multi function mode is 0x%lx\n",
+   cdev->mf_bits);
}
 
DP_INFO(p_hwfn, "Multi function mode is 0x%lx\n",
p_hwfn->cdev->mf_bits);
 
-   /* Read Multi-function information from shmem */
+   /* Read device capabilities information from shmem */
addr

[PATCH net-next 2/4] qed: Remove unused data member 'is_mf_default'.

2018-05-05 Thread Sudarsana Reddy Kalluru

The data member 'is_mf_default' is not used by the qed/qede drivers,
removing the same.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_main.c | 2 --
 include/linux/qed/qed_if.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_main.c 
b/drivers/net/ethernet/qlogic/qed/qed_main.c
index 307fe33..70bc563 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_main.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_main.c
@@ -264,8 +264,6 @@ int qed_fill_dev_info(struct qed_dev *cdev,
dev_info->pci_mem_end = cdev->pci_params.mem_end;
dev_info->pci_irq = cdev->pci_params.irq;
dev_info->rdma_supported = QED_IS_RDMA_PERSONALITY(p_hwfn);
-   dev_info->is_mf_default = !test_bit(QED_MF_LLH_MAC_CLSS,
-   &cdev->mf_bits);
dev_info->dev_type = cdev->type;
ether_addr_copy(dev_info->hw_mac, hw_info->hw_mac_addr);
 
diff --git a/include/linux/qed/qed_if.h b/include/linux/qed/qed_if.h
index 5dac561..907976f 100644
--- a/include/linux/qed/qed_if.h
+++ b/include/linux/qed/qed_if.h
@@ -339,7 +339,6 @@ struct qed_dev_info {
u8  num_hwfns;
 
u8  hw_mac[ETH_ALEN];
-   boolis_mf_default;
 
/* FW version */
u16 fw_major;
-- 
1.8.3.1

[PATCH net-next 4/4] qed: Add support for Unified Fabric Port.

2018-05-05 Thread Sudarsana Reddy Kalluru

This patch adds driver changes for supporting the Unified Fabric Port
(UFP). This is a new paritioning mode wherein MFW provides the set of
parameters to be used by the device such as traffic class, outer-vlan
tag value, priority type etc. Drivers receives this info via notifications
from mfw and configures the hardware accordingly.

Signed-off-by: Sudarsana Reddy Kalluru 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed.h | 20 ++
 drivers/net/ethernet/qlogic/qed/qed_dcbx.c| 14 +++-
 drivers/net/ethernet/qlogic/qed/qed_dev.c | 32 --
 drivers/net/ethernet/qlogic/qed/qed_fcoe.c|  3 +
 drivers/net/ethernet/qlogic/qed/qed_hsi.h | 28 
 drivers/net/ethernet/qlogic/qed/qed_ll2.c | 40 
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 78 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h |  8 +++
 drivers/net/ethernet/qlogic/qed/qed_sp.h  |  9 +++
 drivers/net/ethernet/qlogic/qed/qed_sp_commands.c | 57 -
 drivers/scsi/qedf/qedf_fip.c  |  7 +-
 drivers/scsi/qedf/qedf_main.c |  2 +-
 drivers/scsi/qedi/qedi_iscsi.c|  2 +-
 include/linux/qed/qed_ll2_if.h| 10 ++-
 14 files changed, 283 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed.h 
b/drivers/net/ethernet/qlogic/qed/qed.h
index c8f3507..adcff49 100644
--- a/drivers/net/ethernet/qlogic/qed/qed.h
+++ b/drivers/net/ethernet/qlogic/qed/qed.h
@@ -474,6 +474,24 @@ enum qed_mf_mode_bit {
QED_MF_DSCP_TO_TC_MAP,
 };
 
+enum qed_ufp_mode {
+   QED_UFP_MODE_ETS,
+   QED_UFP_MODE_VNIC_BW,
+   QED_UFP_MODE_UNKNOWN
+};
+
+enum qed_ufp_pri_type {
+   QED_UFP_PRI_OS,
+   QED_UFP_PRI_VNIC,
+   QED_UFP_PRI_UNKNOWN
+};
+
+struct qed_ufp_info {
+   enum qed_ufp_pri_type pri_type;
+   enum qed_ufp_mode mode;
+   u8 tc;
+};
+
 enum BAR_ID {
BAR_ID_0,   /* used for GRC */
BAR_ID_1/* Used for doorbells */
@@ -582,6 +600,8 @@ struct qed_hwfn {
 
struct qed_dcbx_info*p_dcbx_info;
 
+   struct qed_ufp_info ufp_info;
+
struct qed_dmae_infodmae_info;
 
/* QM init */
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c 
b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
index 449777f..8f31406 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dcbx.c
@@ -274,8 +274,8 @@ static bool qed_dcbx_roce_v2_tlv(u32 app_info_bitmap, u16 
proto_id, bool ieee)
 u32 pri_tc_tbl, int count, u8 dcbx_version)
 {
enum dcbx_protocol_type type;
+   bool enable, ieee, eth_tlv;
u8 tc, priority_map;
-   bool enable, ieee;
u16 protocol_id;
int priority;
int i;
@@ -283,6 +283,7 @@ static bool qed_dcbx_roce_v2_tlv(u32 app_info_bitmap, u16 
proto_id, bool ieee)
DP_VERBOSE(p_hwfn, QED_MSG_DCB, "Num APP entries = %d\n", count);
 
ieee = (dcbx_version == DCBX_CONFIG_VERSION_IEEE);
+   eth_tlv = false;
/* Parse APP TLV */
for (i = 0; i < count; i++) {
protocol_id = QED_MFW_GET_FIELD(p_tbl[i].entry,
@@ -304,13 +305,22 @@ static bool qed_dcbx_roce_v2_tlv(u32 app_info_bitmap, u16 
proto_id, bool ieee)
 * indication, but we only got here if there was an
 * app tlv for the protocol, so dcbx must be enabled.
 */
-   enable = !(type == DCBX_PROTOCOL_ETH);
+   if (type == DCBX_PROTOCOL_ETH) {
+   enable = false;
+   eth_tlv = true;
+   } else {
+   enable = true;
+   }
 
qed_dcbx_update_app_info(p_data, p_hwfn, enable,
 priority, tc, type);
}
}
 
+   /* If Eth TLV is not detected, use UFP TC as default TC */
+   if (test_bit(QED_MF_UFP_SPECIFIC, &p_hwfn->cdev->mf_bits) && !eth_tlv)
+   p_data->arr[DCBX_PROTOCOL_ETH].tc = p_hwfn->ufp_info.tc;
+
/* Update ramrod protocol data and hw_info fields
 * with default info when corresponding APP TLV's are not detected.
 * The enabled field has a different logic for ethernet as only for
diff --git a/drivers/net/ethernet/qlogic/qed/qed_dev.c 
b/drivers/net/ethernet/qlogic/qed/qed_dev.c
index 95d00cb..5605289 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_dev.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_dev.c
@@ -1499,6 +1499,11 @@ static int qed_hw_init_pf(struct qed_hwfn *p_hwfn,
STORE_RT_REG(p_hwfn, NIG_REG_LLH_FUNC_TAG_EN_RT_OFFSET, 1);
STORE_RT_REG(p_hwfn, NIG_REG_LLH_FUNC_TAG_VALUE_RT_OFFSET,

Re: [Intel-wired-lan] [PATCH v2 1/1] drivers core: multi-threading device shutdown

2018-05-05 Thread kbuild test robot

Hi Pavel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on v4.17-rc3 next-20180504]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Pavel-Tatashin/multi-threading-device-shutdown/20180506-035150
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/base/core.c:2877:25: sparse: symbol 'device_root_tasks_done' was not 
>> declared. Should it be static?
   drivers/base/core.c:62:5: sparse: context imbalance in 
'device_links_read_lock' - wrong count at exit
   drivers/base/core.c:67:6: sparse: context imbalance in 
'device_links_read_unlock' - unexpected unlock

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] drivers core: device_root_tasks_done can be static

2018-05-05 Thread kbuild test robot


Fixes: 6d54ba128905 ("drivers core: multi-threading device shutdown")
Signed-off-by: Fengguang Wu 
---
 core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 110d797..666c163 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2874,7 +2874,7 @@ static int device_shutdown_child_task(void *data);
  */
 static atomic_tdevice_root_tasks_finished;
 static atomic_tdevice_root_tasks_started;
-struct completion  device_root_tasks_done;
+static struct completion   device_root_tasks_done;
 
 /**
  * Shutdown device tree with root started in dev. If dev has no children

Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread Herbert Xu

On Sun, May 06, 2018 at 07:48:20AM +1000, NeilBrown wrote:
>
> The spinlock protects 2 or more buckets.  The nested table contains at
> least 512 buckets, maybe more.
> It is quite possible for two insertions into 2 different buckets to both
> get their spinlock and both try to instantiate the same nested table.

I think you missed the fact that when we use nested tables the spin
lock table is limited to just a single page and hence corresponds
to the first level in the nested table.  Therefore it's always safe.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread Herbert Xu

On Sun, May 06, 2018 at 08:00:49AM +1000, NeilBrown wrote:
>
> The insert function must (and does) take the lock on the bucket before
> testing if there is a "next" table.
> If one inserter finds that it has locked the "last" table (because there
> is no next) and successfully inserts, then the other inserter cannot
> have locked that table yet, else it would have inserted.  When it does,
> it will find what the first inserter inserted. 

If you release the lock to the first table then it may be deleted
by the resize thread.  Hence the other inserter may not have even
started from the same place.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

87 matches

Mail list logo