Re: [PATCH net] net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages query

2019-05-20 Thread David Miller
From: Tariq Toukan 
Date: Mon, 20 May 2019 17:42:52 +0300

> From: Erez Alfasi 
> 
> Querying EEPROM high pages data for SFP module is currently
> not supported by our driver but is still tried, resulting in
> invalid FW queries.
> 
> Set the EEPROM ethtool data length to 256 for SFP module to
> limit the reading for page 0 only and prevent invalid FW queries.
> 
> Fixes: 7202da8b7f71 ("ethtool, net/mlx4_en: Cable info, 
> get_module_info/eeprom ethtool support")
> Signed-off-by: Erez Alfasi 
> Signed-off-by: Tariq Toukan 
> ---
> 
> Hi Dave, please queue for -stable.

Applied and queued up for -stable, thanks Tariq.


[PATCH net] net/mlx4_en: ethtool, Remove unsupported SFP EEPROM high pages query

2019-05-20 Thread Tariq Toukan
From: Erez Alfasi 

Querying EEPROM high pages data for SFP module is currently
not supported by our driver but is still tried, resulting in
invalid FW queries.

Set the EEPROM ethtool data length to 256 for SFP module to
limit the reading for page 0 only and prevent invalid FW queries.

Fixes: 7202da8b7f71 ("ethtool, net/mlx4_en: Cable info, get_module_info/eeprom 
ethtool support")
Signed-off-by: Erez Alfasi 
Signed-off-by: Tariq Toukan 
---

Hi Dave, please queue for -stable.

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 4 +++-
 drivers/net/ethernet/mellanox/mlx4/port.c   | 5 -
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index d290f0787dfb..94c59939a8cf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -2010,6 +2010,8 @@ static int mlx4_en_set_tunable(struct net_device *dev,
return ret;
 }
 
+#define MLX4_EEPROM_PAGE_LEN 256
+
 static int mlx4_en_get_module_info(struct net_device *dev,
   struct ethtool_modinfo *modinfo)
 {
@@ -2044,7 +2046,7 @@ static int mlx4_en_get_module_info(struct net_device *dev,
break;
case MLX4_MODULE_ID_SFP:
modinfo->type = ETH_MODULE_SFF_8472;
-   modinfo->eeprom_len = ETH_MODULE_SFF_8472_LEN;
+   modinfo->eeprom_len = MLX4_EEPROM_PAGE_LEN;
break;
default:
return -EINVAL;
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c 
b/drivers/net/ethernet/mellanox/mlx4/port.c
index 10fcc22f4590..ba6ac31a339d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -2077,11 +2077,6 @@ int mlx4_get_module_info(struct mlx4_dev *dev, u8 port,
size -= offset + size - I2C_PAGE_SIZE;
 
i2c_addr = I2C_ADDR_LOW;
-   if (offset >= I2C_PAGE_SIZE) {
-   /* Reset offset to high page */
-   i2c_addr = I2C_ADDR_HIGH;
-   offset -= I2C_PAGE_SIZE;
-   }
 
cable_info = (struct mlx4_cable_info *)inmad->data;
cable_info->dev_mem_address = cpu_to_be16(offset);
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-02-06 Thread Saeed Mahameed
On Wed, Feb 6, 2019 at 12:17 PM Eric Dumazet  wrote:
>
> On Wed, Feb 6, 2019 at 12:15 PM Saeed Mahameed
>  wrote:
> >
> > Ok, just verified, csum complete (cqe->checksum) is provided and valid
> > for non-TCP/UDP ip packets, (on specific ConnectX3 HWs)
> > e.g. ICMP packets or IP fragments go through csum complete,  that
> > being said, looking at the code before my patch.
> > when cqe->checksum complete is not valid a IP non-TCP/UDP packet will
> > report csum NONE, which means my
> > TODO comment is valid even without my patch :).
> >
> > We can remove the TODO, although i am fine with it if it kept there,
> > since it is valid,
> > but we must add a future optimization task (to tariq's backlog ;)) for
> > IP non-TCP/UDP traffic to check for
> > csum unnecessary when csum complete is not an option.
> >
> >
>
> Thanks for double checking.
> You might add more details in the changelog for future generations ;)

Great, I will do that, we will post V2,

Thank you Eric.


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-02-06 Thread Eric Dumazet
On Wed, Feb 6, 2019 at 12:15 PM Saeed Mahameed
 wrote:
>
> Ok, just verified, csum complete (cqe->checksum) is provided and valid
> for non-TCP/UDP ip packets, (on specific ConnectX3 HWs)
> e.g. ICMP packets or IP fragments go through csum complete,  that
> being said, looking at the code before my patch.
> when cqe->checksum complete is not valid a IP non-TCP/UDP packet will
> report csum NONE, which means my
> TODO comment is valid even without my patch :).
>
> We can remove the TODO, although i am fine with it if it kept there,
> since it is valid,
> but we must add a future optimization task (to tariq's backlog ;)) for
> IP non-TCP/UDP traffic to check for
> csum unnecessary when csum complete is not an option.
>
>

Thanks for double checking.
You might add more details in the changelog for future generations ;)


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-02-06 Thread Saeed Mahameed
On Thu, Jan 31, 2019 at 4:06 PM Saeed Mahameed  wrote:
>
> On Thu, 2019-01-31 at 11:42 -0800, Eric Dumazet wrote:
> > On Thu, Jan 31, 2019 at 11:27 AM Saeed Mahameed 
> > wrote:
> > > Are you sure ? you are claiming that the hardware will skip csum
> > > complete i.e cqe->checksum will be 0x for padded short IP
> > > frames.
> > > i don't think this is the case, the whole bug is that the hw does
> > > provide a partial cqe->checksum (i.e doesn't included the padding
> > > bytes) even for short eth frames.
> >
> > If the padding is not included, then cqe->checksum is 0x for
> > correctly received frames.
> >
> > Otherwise, what would be cqe->checksum in this case ? A random value
> > ?
>
> the actual checksum of IP headers+IP payload, while ignoring the
> padding bytes, which is the bug, let me double check..
>
>

Ok, just verified, csum complete (cqe->checksum) is provided and valid
for non-TCP/UDP ip packets, (on specific ConnectX3 HWs)
e.g. ICMP packets or IP fragments go through csum complete,  that
being said, looking at the code before my patch.
when cqe->checksum complete is not valid a IP non-TCP/UDP packet will
report csum NONE, which means my
TODO comment is valid even without my patch :).

We can remove the TODO, although i am fine with it if it kept there,
since it is valid,
but we must add a future optimization task (to tariq's backlog ;)) for
IP non-TCP/UDP traffic to check for
csum unnecessary when csum complete is not an option.

Thanks,
Saeed.


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Saeed Mahameed
On Thu, 2019-01-31 at 11:42 -0800, Eric Dumazet wrote:
> On Thu, Jan 31, 2019 at 11:27 AM Saeed Mahameed 
> wrote:
> > Are you sure ? you are claiming that the hardware will skip csum
> > complete i.e cqe->checksum will be 0x for padded short IP
> > frames.
> > i don't think this is the case, the whole bug is that the hw does
> > provide a partial cqe->checksum (i.e doesn't included the padding
> > bytes) even for short eth frames.
> 
> If the padding is not included, then cqe->checksum is 0x for
> correctly received frames.
> 
> Otherwise, what would be cqe->checksum in this case ? A random value
> ?

the actual checksum of IP headers+IP payload, while ignoring the
padding bytes, which is the bug, let me double check..




Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Eric Dumazet
On Thu, Jan 31, 2019 at 11:27 AM Saeed Mahameed  wrote:
>
> Are you sure ? you are claiming that the hardware will skip csum
> complete i.e cqe->checksum will be 0x for padded short IP frames.
> i don't think this is the case, the whole bug is that the hw does
> provide a partial cqe->checksum (i.e doesn't included the padding
> bytes) even for short eth frames.

If the padding is not included, then cqe->checksum is 0x for
correctly received frames.

Otherwise, what would be cqe->checksum in this case ? A random value ?


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Saeed Mahameed
On Thu, 2019-01-31 at 11:07 -0800, Eric Dumazet wrote:
> On Thu, Jan 31, 2019 at 11:04 AM Saeed Mahameed 
> wrote:
> > On Thu, 2019-01-31 at 07:02 -0800, Eric Dumazet wrote:
> > > On Thu, Jan 31, 2019 at 5:04 AM Tariq Toukan  > > >
> > > wrote:
> > > > From: Saeed Mahameed 
> > > > 
> > > > When an ethernet frame is padded to meet the minimum ethernet
> > > > frame
> > > > size, the padding octets are not covered by the hardware
> > > > checksum.
> > > > Fortunately the padding octets are usually zero's, which don't
> > > > affect
> > > > checksum. However, it is not guaranteed. For example, switches
> > > > might
> > > > choose to make other use of these octets.
> > > > This repeatedly causes kernel hardware checksum fault.
> > > > 
> > > > Prior to the cited commit below, skb checksum was forced to be
> > > > CHECKSUM_NONE when padding is detected. After it, we need to
> > > > keep
> > > > skb->csum updated. However, fixing up CHECKSUM_COMPLETE
> > > > requires to
> > > > verify and parse IP headers, it does not worth the effort as
> > > > the
> > > > packets
> > > > are so small that CHECKSUM_COMPLETE has no significant
> > > > advantage.
> > > > 
> > > > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and
> > > > CHECKSUM_COMPLETE
> > > > are friends")
> > > > Cc: Eric Dumazet 
> > > > Signed-off-by: Saeed Mahameed 
> > > > Signed-off-by: Tariq Toukan 
> > > > ---
> > > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19
> > > > +-
> > > > -
> > > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > > > 
> > > > Hi Dave,
> > > > Please queue for -stable >= v4.18.
> > > > 
> > > > Thanks,
> > > > Tariq
> > > > 
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > index 9a0881cb7f51..fc8a11444e1a 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > > @@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum
> > > > hw_checksum, struct sk_buff *skb,
> > > >  }
> > > >  #endif
> > > > 
> > > > +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
> > > > +
> > > >  /* We reach this function only after checking that any of
> > > >   * the (IPv4 | IPv6) bits are set in cqe->status.
> > > >   */
> > > > @@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe
> > > > *cqe,
> > > > struct sk_buff *skb, void *va,
> > > >   netdev_features_t dev_features)
> > > >  {
> > > > __wsum hw_checksum = 0;
> > > > +   void *hdr;
> > > > +
> > > > +   /* CQE csum doesn't cover padding octets in short
> > > > ethernet
> > > > +* frames. And the pad field is appended prior to
> > > > calculating
> > > > +* and appending the FCS field.
> > > > +*
> > > > +* Detecting these padded frames requires to verify and
> > > > parse
> > > > +* IP headers, so we simply force all those small
> > > > frames to
> > > > be
> > > > +* CHECKSUM_NONE even if they are not padded.
> > > > +* TODO: better if we report CHECKSUM_UNNECESSARY but
> > > > this
> > > > +* demands some refactroing.
> > > 
> > > This TODO: looks bogus to me.
> > > 
> > > mlx4 driver first tries to use CHECKSUM_UNNECESSARY.
> > > 
> > > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> > >   MLX4_CQE_STATUS
> > > _UDP
> > > )) &&
> > >  (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> > >   cqe->checksum == cpu_to_be16(0x)) {
> > >  ...
> > >   ip_summed = CHECKSUM_UNNECESSARY;
> > >   ...
> > > }
> > > 
> > > Then if this attempt failed, it tries to use CHECKSUM_COMPLETE
> > > (and
> > > calls this check_sum() function)
> > > 
> > > So there is no refactoring to be done for mlx4 : short
> > > IPv[46]+{UDP|TCP} frames get CHECKSUM_UNNECESSARY already
> > 
> > not exactly, considering mlx4 weird condition to decide to do
> > CHECKSUM_UNNECESSARY:
> > 
> > if ((cqe csum fields are valid) && cqe->checksum ==
> > cpu_to_be16(0x))
> >  { do CHECKSUM_UNNECESSARY }
> > else { do CHECKSUM_COMPLETE }
> > 
> > CHECKSUM_UNNECESSARY will be skipped if csum complete field is
> > valid
> > i.e (cqe->checksum != 0x)
> > 
> > but if checksum complete is to be skipped due to short_frame we
> > want to
> > go back to CHECKSUM_UNNECESSARY, this is the refactoring i am
> > talking
> > about :).
> > 
> > 
> > I hope now it is clear..
> 
> Not really, since in case of short frame, cqe->checksum will be
> 0x, since mlx4 does not include the padding bytes in the checksum
> in the first place.
> 

Are you sure ? you are claiming that the hardware will skip csum
complete i.e cqe->checksum will be 0x for padded short IP frames.
i don't think this is the case, the whole bug is that the hw does
provide a partial cqe->checksum (i.e doesn't included the padding
bytes) even for short eth frames.


> (For native IPv4/IPV6 TCP/UDP frames

Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Eric Dumazet
On Thu, Jan 31, 2019 at 11:04 AM Saeed Mahameed  wrote:
>
> On Thu, 2019-01-31 at 07:02 -0800, Eric Dumazet wrote:
> > On Thu, Jan 31, 2019 at 5:04 AM Tariq Toukan 
> > wrote:
> > > From: Saeed Mahameed 
> > >
> > > When an ethernet frame is padded to meet the minimum ethernet frame
> > > size, the padding octets are not covered by the hardware checksum.
> > > Fortunately the padding octets are usually zero's, which don't
> > > affect
> > > checksum. However, it is not guaranteed. For example, switches
> > > might
> > > choose to make other use of these octets.
> > > This repeatedly causes kernel hardware checksum fault.
> > >
> > > Prior to the cited commit below, skb checksum was forced to be
> > > CHECKSUM_NONE when padding is detected. After it, we need to keep
> > > skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
> > > verify and parse IP headers, it does not worth the effort as the
> > > packets
> > > are so small that CHECKSUM_COMPLETE has no significant advantage.
> > >
> > > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
> > > are friends")
> > > Cc: Eric Dumazet 
> > > Signed-off-by: Saeed Mahameed 
> > > Signed-off-by: Tariq Toukan 
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +-
> > > -
> > >  1 file changed, 17 insertions(+), 2 deletions(-)
> > >
> > > Hi Dave,
> > > Please queue for -stable >= v4.18.
> > >
> > > Thanks,
> > > Tariq
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > index 9a0881cb7f51..fc8a11444e1a 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > > @@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum
> > > hw_checksum, struct sk_buff *skb,
> > >  }
> > >  #endif
> > >
> > > +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
> > > +
> > >  /* We reach this function only after checking that any of
> > >   * the (IPv4 | IPv6) bits are set in cqe->status.
> > >   */
> > > @@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe,
> > > struct sk_buff *skb, void *va,
> > >   netdev_features_t dev_features)
> > >  {
> > > __wsum hw_checksum = 0;
> > > +   void *hdr;
> > > +
> > > +   /* CQE csum doesn't cover padding octets in short ethernet
> > > +* frames. And the pad field is appended prior to
> > > calculating
> > > +* and appending the FCS field.
> > > +*
> > > +* Detecting these padded frames requires to verify and
> > > parse
> > > +* IP headers, so we simply force all those small frames to
> > > be
> > > +* CHECKSUM_NONE even if they are not padded.
> > > +* TODO: better if we report CHECKSUM_UNNECESSARY but this
> > > +* demands some refactroing.
> >
> > This TODO: looks bogus to me.
> >
> > mlx4 driver first tries to use CHECKSUM_UNNECESSARY.
> >
> > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
> >   MLX4_CQE_STATUS_UDP
> > )) &&
> >  (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
> >   cqe->checksum == cpu_to_be16(0x)) {
> >  ...
> >   ip_summed = CHECKSUM_UNNECESSARY;
> >   ...
> > }
> >
> > Then if this attempt failed, it tries to use CHECKSUM_COMPLETE (and
> > calls this check_sum() function)
> >
> > So there is no refactoring to be done for mlx4 : short
> > IPv[46]+{UDP|TCP} frames get CHECKSUM_UNNECESSARY already
>
> not exactly, considering mlx4 weird condition to decide to do
> CHECKSUM_UNNECESSARY:
>
> if ((cqe csum fields are valid) && cqe->checksum ==
> cpu_to_be16(0x))
>  { do CHECKSUM_UNNECESSARY }
> else { do CHECKSUM_COMPLETE }
>
> CHECKSUM_UNNECESSARY will be skipped if csum complete field is valid
> i.e (cqe->checksum != 0x)
>
> but if checksum complete is to be skipped due to short_frame we want to
> go back to CHECKSUM_UNNECESSARY, this is the refactoring i am talking
> about :).
>
>
> I hope now it is clear..

Not really, since in case of short frame, cqe->checksum will be
0x, since mlx4 does not include the padding bytes in the checksum
in the first place.

(For native IPv4/IPV6 TCP/UDP frames that is)

>
>
> >
> > > +*/
> > > +   if (short_frame(skb->len))
> > > +   return -EINVAL;
> > >
> > > -   void *hdr = (u8 *)va + sizeof(struct ethhdr);
> > > -
> > > +   hdr = (u8 *)va + sizeof(struct ethhdr);
> > > hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > >
> > > if (cqe->vlan_my_qpn &
> > > cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
> > > --
> > > 1.8.3.1
> > >


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Saeed Mahameed
On Thu, 2019-01-31 at 07:02 -0800, Eric Dumazet wrote:
> On Thu, Jan 31, 2019 at 5:04 AM Tariq Toukan 
> wrote:
> > From: Saeed Mahameed 
> > 
> > When an ethernet frame is padded to meet the minimum ethernet frame
> > size, the padding octets are not covered by the hardware checksum.
> > Fortunately the padding octets are usually zero's, which don't
> > affect
> > checksum. However, it is not guaranteed. For example, switches
> > might
> > choose to make other use of these octets.
> > This repeatedly causes kernel hardware checksum fault.
> > 
> > Prior to the cited commit below, skb checksum was forced to be
> > CHECKSUM_NONE when padding is detected. After it, we need to keep
> > skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
> > verify and parse IP headers, it does not worth the effort as the
> > packets
> > are so small that CHECKSUM_COMPLETE has no significant advantage.
> > 
> > Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
> > are friends")
> > Cc: Eric Dumazet 
> > Signed-off-by: Saeed Mahameed 
> > Signed-off-by: Tariq Toukan 
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +-
> > -
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> > 
> > Hi Dave,
> > Please queue for -stable >= v4.18.
> > 
> > Thanks,
> > Tariq
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 9a0881cb7f51..fc8a11444e1a 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum
> > hw_checksum, struct sk_buff *skb,
> >  }
> >  #endif
> > 
> > +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
> > +
> >  /* We reach this function only after checking that any of
> >   * the (IPv4 | IPv6) bits are set in cqe->status.
> >   */
> > @@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe,
> > struct sk_buff *skb, void *va,
> >   netdev_features_t dev_features)
> >  {
> > __wsum hw_checksum = 0;
> > +   void *hdr;
> > +
> > +   /* CQE csum doesn't cover padding octets in short ethernet
> > +* frames. And the pad field is appended prior to
> > calculating
> > +* and appending the FCS field.
> > +*
> > +* Detecting these padded frames requires to verify and
> > parse
> > +* IP headers, so we simply force all those small frames to
> > be
> > +* CHECKSUM_NONE even if they are not padded.
> > +* TODO: better if we report CHECKSUM_UNNECESSARY but this
> > +* demands some refactroing.
> 
> This TODO: looks bogus to me.
> 
> mlx4 driver first tries to use CHECKSUM_UNNECESSARY.
> 
> if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>   MLX4_CQE_STATUS_UDP
> )) &&
>  (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
>   cqe->checksum == cpu_to_be16(0x)) {
>  ...
>   ip_summed = CHECKSUM_UNNECESSARY;
>   ...
> }
> 
> Then if this attempt failed, it tries to use CHECKSUM_COMPLETE (and
> calls this check_sum() function)
> 
> So there is no refactoring to be done for mlx4 : short
> IPv[46]+{UDP|TCP} frames get CHECKSUM_UNNECESSARY already

not exactly, considering mlx4 weird condition to decide to do
CHECKSUM_UNNECESSARY:

if ((cqe csum fields are valid) && cqe->checksum ==
cpu_to_be16(0x)) 
 { do CHECKSUM_UNNECESSARY }
else { do CHECKSUM_COMPLETE }

CHECKSUM_UNNECESSARY will be skipped if csum complete field is valid 
i.e (cqe->checksum != 0x)

but if checksum complete is to be skipped due to short_frame we want to
go back to CHECKSUM_UNNECESSARY, this is the refactoring i am talking
about :).


I hope now it is clear.. 


> 
> > +*/
> > +   if (short_frame(skb->len))
> > +   return -EINVAL;
> > 
> > -   void *hdr = (u8 *)va + sizeof(struct ethhdr);
> > -
> > +   hdr = (u8 *)va + sizeof(struct ethhdr);
> > hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
> > 
> > if (cqe->vlan_my_qpn &
> > cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
> > --
> > 1.8.3.1
> > 


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread David Miller
From: Tariq Toukan 
Date: Thu, 31 Jan 2019 15:02:43 +0200

> From: Saeed Mahameed 
> 
> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are usually zero's, which don't affect
> checksum. However, it is not guaranteed. For example, switches might
> choose to make other use of these octets.
> This repeatedly causes kernel hardware checksum fault.
> 
> Prior to the cited commit below, skb checksum was forced to be
> CHECKSUM_NONE when padding is detected. After it, we need to keep
> skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
> verify and parse IP headers, it does not worth the effort as the packets
> are so small that CHECKSUM_COMPLETE has no significant advantage.
> 
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends")
> Cc: Eric Dumazet 
> Signed-off-by: Saeed Mahameed 
> Signed-off-by: Tariq Toukan 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> Hi Dave,
> Please queue for -stable >= v4.18.

Please look into Eric's feedback and update the comment as needed.

Thank you.


Re: [PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Eric Dumazet
On Thu, Jan 31, 2019 at 5:04 AM Tariq Toukan  wrote:
>
> From: Saeed Mahameed 
>
> When an ethernet frame is padded to meet the minimum ethernet frame
> size, the padding octets are not covered by the hardware checksum.
> Fortunately the padding octets are usually zero's, which don't affect
> checksum. However, it is not guaranteed. For example, switches might
> choose to make other use of these octets.
> This repeatedly causes kernel hardware checksum fault.
>
> Prior to the cited commit below, skb checksum was forced to be
> CHECKSUM_NONE when padding is detected. After it, we need to keep
> skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
> verify and parse IP headers, it does not worth the effort as the packets
> are so small that CHECKSUM_COMPLETE has no significant advantage.
>
> Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are 
> friends")
> Cc: Eric Dumazet 
> Signed-off-by: Saeed Mahameed 
> Signed-off-by: Tariq Toukan 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
>
> Hi Dave,
> Please queue for -stable >= v4.18.
>
> Thanks,
> Tariq
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 9a0881cb7f51..fc8a11444e1a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct 
> sk_buff *skb,
>  }
>  #endif
>
> +#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
> +
>  /* We reach this function only after checking that any of
>   * the (IPv4 | IPv6) bits are set in cqe->status.
>   */
> @@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe, struct 
> sk_buff *skb, void *va,
>   netdev_features_t dev_features)
>  {
> __wsum hw_checksum = 0;
> +   void *hdr;
> +
> +   /* CQE csum doesn't cover padding octets in short ethernet
> +* frames. And the pad field is appended prior to calculating
> +* and appending the FCS field.
> +*
> +* Detecting these padded frames requires to verify and parse
> +* IP headers, so we simply force all those small frames to be
> +* CHECKSUM_NONE even if they are not padded.
> +* TODO: better if we report CHECKSUM_UNNECESSARY but this
> +* demands some refactroing.

This TODO: looks bogus to me.

mlx4 driver first tries to use CHECKSUM_UNNECESSARY.

if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
  MLX4_CQE_STATUS_UDP)) &&
 (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
  cqe->checksum == cpu_to_be16(0x)) {
 ...
  ip_summed = CHECKSUM_UNNECESSARY;
  ...
}

Then if this attempt failed, it tries to use CHECKSUM_COMPLETE (and
calls this check_sum() function)

So there is no refactoring to be done for mlx4 : short
IPv[46]+{UDP|TCP} frames get CHECKSUM_UNNECESSARY already

> +*/
> +   if (short_frame(skb->len))
> +   return -EINVAL;
>
> -   void *hdr = (u8 *)va + sizeof(struct ethhdr);
> -
> +   hdr = (u8 *)va + sizeof(struct ethhdr);
> hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
>
> if (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
> --
> 1.8.3.1
>


[PATCH net] net/mlx4_en: Force CHECKSUM_NONE for short ethernet frames

2019-01-31 Thread Tariq Toukan
From: Saeed Mahameed 

When an ethernet frame is padded to meet the minimum ethernet frame
size, the padding octets are not covered by the hardware checksum.
Fortunately the padding octets are usually zero's, which don't affect
checksum. However, it is not guaranteed. For example, switches might
choose to make other use of these octets.
This repeatedly causes kernel hardware checksum fault.

Prior to the cited commit below, skb checksum was forced to be
CHECKSUM_NONE when padding is detected. After it, we need to keep
skb->csum updated. However, fixing up CHECKSUM_COMPLETE requires to
verify and parse IP headers, it does not worth the effort as the packets
are so small that CHECKSUM_COMPLETE has no significant advantage.

Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
Cc: Eric Dumazet 
Signed-off-by: Saeed Mahameed 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

Hi Dave,
Please queue for -stable >= v4.18.

Thanks,
Tariq

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9a0881cb7f51..fc8a11444e1a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -617,6 +617,8 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct 
sk_buff *skb,
 }
 #endif
 
+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 /* We reach this function only after checking that any of
  * the (IPv4 | IPv6) bits are set in cqe->status.
  */
@@ -624,9 +626,22 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff 
*skb, void *va,
  netdev_features_t dev_features)
 {
__wsum hw_checksum = 0;
+   void *hdr;
+
+   /* CQE csum doesn't cover padding octets in short ethernet
+* frames. And the pad field is appended prior to calculating
+* and appending the FCS field.
+*
+* Detecting these padded frames requires to verify and parse
+* IP headers, so we simply force all those small frames to be
+* CHECKSUM_NONE even if they are not padded.
+* TODO: better if we report CHECKSUM_UNNECESSARY but this
+* demands some refactroing.
+*/
+   if (short_frame(skb->len))
+   return -EINVAL;
 
-   void *hdr = (u8 *)va + sizeof(struct ethhdr);
-
+   hdr = (u8 *)va + sizeof(struct ethhdr);
hw_checksum = csum_unfold((__force __sum16)cqe->checksum);
 
if (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_CVLAN_PRESENT_MASK) &&
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: add a missing include

2018-10-31 Thread Abdul Haleem
On Tue, 2018-10-30 at 00:18 -0700, Eric Dumazet wrote:
> Abdul Haleem reported a build error on ppc :
> 
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
> iphdr` declared inside parameter list [enabled by default]
>struct iphdr *iph)
>   ^
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
> only this definition or declaration, which is probably not what you want
> [enabled by default]
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
> get_fixed_ipv4_csum:
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
> pointer to incomplete type
>   __u8 ipproto = iph->protocol;
> ^
> 
> Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when 
> not needed")
> Signed-off-by: Eric Dumazet 
> Reported-by: Abdul Haleem 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> 5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -43,6 +43,7 @@
>  #include 
>  #include 
> 
> +#include 
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include 
>  #endif

Tested-by: Abdul Haleem 
-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH net] net/mlx4_en: add a missing include

2018-10-31 Thread Tariq Toukan


On 30/10/2018 8:19 PM, David Miller wrote:
> From: Eric Dumazet 
> Date: Tue, 30 Oct 2018 00:18:12 -0700
> 
>> Abdul Haleem reported a build error on ppc :
>>
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
>> iphdr` declared inside parameter list [enabled by default]
>> struct iphdr *iph)
>>^
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
>> only this definition or declaration, which is probably not what you want
>> [enabled by default]
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
>> get_fixed_ipv4_csum:
>> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
>> pointer to incomplete type
>>__u8 ipproto = iph->protocol;
>>  ^
>>
>> Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when 
>> not needed")
>> Signed-off-by: Eric Dumazet 
>> Reported-by: Abdul Haleem 
> 
> Applied, thanks Eric.
> 

Thanks for the report, Abdul Haleem.
Thanks for your patch, Eric.

Regards,
Tariq


Re: [PATCH net] net/mlx4_en: add a missing include

2018-10-30 Thread David Miller
From: Eric Dumazet 
Date: Tue, 30 Oct 2018 00:18:12 -0700

> Abdul Haleem reported a build error on ppc :
> 
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
> iphdr` declared inside parameter list [enabled by default]
>struct iphdr *iph)
>   ^
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
> only this definition or declaration, which is probably not what you want
> [enabled by default]
> drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
> get_fixed_ipv4_csum:
> drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
> pointer to incomplete type
>   __u8 ipproto = iph->protocol;
> ^
> 
> Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when 
> not needed")
> Signed-off-by: Eric Dumazet 
> Reported-by: Abdul Haleem 

Applied, thanks Eric.


[PATCH net] net/mlx4_en: add a missing include

2018-10-30 Thread Eric Dumazet
Abdul Haleem reported a build error on ppc :

drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: `struct
iphdr` declared inside parameter list [enabled by default]
   struct iphdr *iph)
  ^
drivers/net/ethernet/mellanox/mlx4/en_rx.c:582:18: warning: its scope is
only this definition or declaration, which is probably not what you want
[enabled by default]
drivers/net/ethernet/mellanox/mlx4/en_rx.c: In function
get_fixed_ipv4_csum:
drivers/net/ethernet/mellanox/mlx4/en_rx.c:586:20: error: dereferencing
pointer to incomplete type
  __u8 ipproto = iph->protocol;
^

Fixes: 55469bc6b577 ("drivers: net: remove  inclusion when not 
needed")
Signed-off-by: Eric Dumazet 
Reported-by: Abdul Haleem 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
5a6d0919533d6e0e619927abd753c5d07ed95dac..db00bf1c23f5ad31d64652ddc8bee32e2e7534c8
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 
+#include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
 #endif
-- 
2.19.1.568.g152ad8e336-goog



Re: [PATCH net] net/mlx4_en: Don't reuse RX page when XDP is set

2018-07-16 Thread David Miller
From: Tariq Toukan 
Date: Sun, 15 Jul 2018 13:54:39 +0300

> From: Saeed Mahameed 
> 
> When a new rx packet arrives, the rx path will decide whether to reuse
> the remainder of the page or not according to one of the below conditions:
> 1. frag_info->frag_stride == PAGE_SIZE / 2
> 2. frags->page_offset + frag_info->frag_size > PAGE_SIZE;
> 
> The first condition is no met for when XDP is set.
> For XDP, page_offset is always set to priv->rx_headroom which is
> XDP_PACKET_HEADROOM and frag_info->frag_size is around mtu size + some
> padding, still the 2nd release condition will hold since
> XDP_PACKET_HEADROOM + 1536 < PAGE_SIZE, as a result the page will not
> be released and will be _wrongly_ reused for next free rx descriptor.
> 
> In XDP there is an assumption to have a page per packet and reuse can
> break such assumption and might cause packet data corruptions.
> 
> Fix this by adding an extra condition (!priv->rx_headroom) to the 2nd
> case to avoid page reuse when XDP is set, since rx_headroom is set to 0
> for non XDP setup and set to XDP_PACKET_HEADROOM for XDP setup.
> 
> No additional cache line is required for the new condition.
> 
> Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path")
> Signed-off-by: Saeed Mahameed 
> Signed-off-by: Tariq Toukan 
> Suggested-by: Martin KaFai Lau 

Applied and queued up for -stable.


[PATCH net] net/mlx4_en: Don't reuse RX page when XDP is set

2018-07-15 Thread Tariq Toukan
From: Saeed Mahameed 

When a new rx packet arrives, the rx path will decide whether to reuse
the remainder of the page or not according to one of the below conditions:
1. frag_info->frag_stride == PAGE_SIZE / 2
2. frags->page_offset + frag_info->frag_size > PAGE_SIZE;

The first condition is no met for when XDP is set.
For XDP, page_offset is always set to priv->rx_headroom which is
XDP_PACKET_HEADROOM and frag_info->frag_size is around mtu size + some
padding, still the 2nd release condition will hold since
XDP_PACKET_HEADROOM + 1536 < PAGE_SIZE, as a result the page will not
be released and will be _wrongly_ reused for next free rx descriptor.

In XDP there is an assumption to have a page per packet and reuse can
break such assumption and might cause packet data corruptions.

Fix this by adding an extra condition (!priv->rx_headroom) to the 2nd
case to avoid page reuse when XDP is set, since rx_headroom is set to 0
for non XDP setup and set to XDP_PACKET_HEADROOM for XDP setup.

No additional cache line is required for the new condition.

Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path")
Signed-off-by: Saeed Mahameed 
Signed-off-by: Tariq Toukan 
Suggested-by: Martin KaFai Lau 
CC: Eric Dumazet 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9f54ccbddea7..3360f7b9ee73 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -474,10 +474,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
 {
const struct mlx4_en_frag_info *frag_info = priv->frag_info;
unsigned int truesize = 0;
+   bool release = true;
int nr, frag_size;
struct page *page;
dma_addr_t dma;
-   bool release;
 
/* Collect used fragments while replacing them in the HW descriptors */
for (nr = 0;; frags++) {
@@ -500,7 +500,11 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
release = page_count(page) != 1 ||
  page_is_pfmemalloc(page) ||
  page_to_nid(page) != numa_mem_id();
-   } else {
+   } else if (!priv->rx_headroom) {
+   /* rx_headroom for non XDP setup is always 0.
+* When XDP is set, the above condition will
+* guarantee page is always released.
+*/
u32 sz_align = ALIGN(frag_size, SMP_CACHE_BYTES);
 
frags->page_offset += sz_align;
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: Verify coalescing parameters are in range

2018-05-10 Thread David Miller
From: Tariq Toukan 
Date: Wed,  9 May 2018 18:35:13 +0300

> From: Moshe Shemesh 
> 
> Add check of coalescing parameters received through ethtool are within
> range of values supported by the HW.
> Driver gets the coalescing rx/tx-usecs and rx/tx-frames as set by the
> users through ethtool. The ethtool support up to 32 bit value for each.
> However, mlx4 modify cq limits the coalescing time parameter and
> coalescing frames parameters to 16 bits.
> Return out of range error if user tries to set these parameters to
> higher values.
> Change type of sample-interval and adaptive_rx_coal parameters in mlx4
> driver to u32 as the ethtool holds them as u32 and these parameters are
> not limited due to mlx4 HW.
> 
> Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
> Signed-off-by: Moshe Shemesh 
> Signed-off-by: Tariq Toukan 

Applied and queued up for -stable.


[PATCH net] net/mlx4_en: Verify coalescing parameters are in range

2018-05-09 Thread Tariq Toukan
From: Moshe Shemesh 

Add check of coalescing parameters received through ethtool are within
range of values supported by the HW.
Driver gets the coalescing rx/tx-usecs and rx/tx-frames as set by the
users through ethtool. The ethtool support up to 32 bit value for each.
However, mlx4 modify cq limits the coalescing time parameter and
coalescing frames parameters to 16 bits.
Return out of range error if user tries to set these parameters to
higher values.
Change type of sample-interval and adaptive_rx_coal parameters in mlx4
driver to u32 as the ethtool holds them as u32 and these parameters are
not limited due to mlx4 HW.

Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Moshe Shemesh 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 16 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h|  7 +--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index a30a2e95d13f..f11b45001cad 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1027,6 +1027,22 @@ static int mlx4_en_set_coalesce(struct net_device *dev,
if (!coal->tx_max_coalesced_frames_irq)
return -EINVAL;
 
+   if (coal->tx_coalesce_usecs > MLX4_EN_MAX_COAL_TIME ||
+   coal->rx_coalesce_usecs > MLX4_EN_MAX_COAL_TIME ||
+   coal->rx_coalesce_usecs_low > MLX4_EN_MAX_COAL_TIME ||
+   coal->rx_coalesce_usecs_high > MLX4_EN_MAX_COAL_TIME) {
+   netdev_info(dev, "%s: maximum coalesce time supported is %d 
usecs\n",
+   __func__, MLX4_EN_MAX_COAL_TIME);
+   return -ERANGE;
+   }
+
+   if (coal->tx_max_coalesced_frames > MLX4_EN_MAX_COAL_PKTS ||
+   coal->rx_max_coalesced_frames > MLX4_EN_MAX_COAL_PKTS) {
+   netdev_info(dev, "%s: maximum coalesced frames supported is 
%d\n",
+   __func__, MLX4_EN_MAX_COAL_PKTS);
+   return -ERANGE;
+   }
+
priv->rx_frames = (coal->rx_max_coalesced_frames ==
   MLX4_EN_AUTO_CONF) ?
MLX4_EN_RX_COAL_TARGET :
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index f7c81133594f..ace6545f82e6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -132,6 +132,9 @@
 #define MLX4_EN_TX_COAL_PKTS   16
 #define MLX4_EN_TX_COAL_TIME   0x10
 
+#define MLX4_EN_MAX_COAL_PKTS  U16_MAX
+#define MLX4_EN_MAX_COAL_TIME  U16_MAX
+
 #define MLX4_EN_RX_RATE_LOW40
 #define MLX4_EN_RX_COAL_TIME_LOW   0
 #define MLX4_EN_RX_RATE_HIGH   45
@@ -552,8 +555,8 @@ struct mlx4_en_priv {
u16 rx_usecs_low;
u32 pkt_rate_high;
u16 rx_usecs_high;
-   u16 sample_interval;
-   u16 adaptive_rx_coal;
+   u32 sample_interval;
+   u32 adaptive_rx_coal;
u32 msg_enable;
u32 loopback_ok;
u32 validate_loopback;
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

2017-02-26 Thread David Miller
From: Eric Dumazet 
Date: Thu, 23 Feb 2017 15:22:43 -0800

> From: Eric Dumazet 
> 
> The cited commit makes a great job of finding optimal shift/multiplier
> values assuming a 10 seconds wrap around, but forgot to change the
> overflow_period computation.
> 
> It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
> which is silly.
> 
> Lets simply use 5 seconds, no need to recompute this, given how it is
> supposed to work.
> 
> Later, we will use a timer instead of a work queue, since the new RX
> allocation schem will no longer need mlx4_en_recover_from_oom() and the
> service_task firing every 250 ms.
> 
> Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according 
> to HW frequency")
> Signed-off-by: Eric Dumazet 

Applied, thanks.


Re: [PATCH net] net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

2017-02-26 Thread Tariq Toukan



On 24/02/2017 1:22 AM, Eric Dumazet wrote:

From: Eric Dumazet 

The cited commit makes a great job of finding optimal shift/multiplier
values assuming a 10 seconds wrap around, but forgot to change the
overflow_period computation.

It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
which is silly.

Lets simply use 5 seconds, no need to recompute this, given how it is
supposed to work.

Later, we will use a timer instead of a work queue, since the new RX
allocation schem will no longer need mlx4_en_recover_from_oom() and the
service_task firing every 250 ms.

Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according to HW 
frequency")
Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Eugenia Emantayev 
---
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |   18 +++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |1
 2 files changed, 8 insertions(+), 11 deletions(-)



Reviewed-by: Tariq Toukan 

Thanks for your patch.



Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

2017-02-26 Thread Eric Dumazet
On Sun, 2017-02-26 at 09:40 -0800, Eric Dumazet wrote:
> NAPI_STATE_SCHED
> 
> Actually we could use an additional bit for that, that the driver would
> set even if NAPI_STATE_SCHED could not be grabbed.

Just to be clear :

Drivers would require no change, this would be done in
existing helpers.





Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

2017-02-26 Thread Eric Dumazet
On Sun, 2017-02-26 at 09:34 -0800, Eric Dumazet wrote:

> I do not believe this bug is mlx4 specific.
> 
> Anything doing the following while hard irq were not masked :
> 
> local_bh_disable();
> napi_reschedule(&priv->rx_cq[ring]->napi);
> local_bh_enable();
> 
> Like in mlx4_en_recover_from_oom()
> 
> Can trigger the issue really.
> 
> Unfortunately I do not see how core layer can handle this.
> Only the driver hard irq could possibly know that it could not grab
> NAPI_STATE_SCHED

Actually we could use an additional bit for that, that the driver would
set even if NAPI_STATE_SCHED could not be grabbed.

Let me try something.






Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

2017-02-26 Thread Eric Dumazet
On Sun, 2017-02-26 at 18:32 +0200, Saeed Mahameed wrote:
> On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet  wrote:
> > From: Eric Dumazet 
> >
> > While playing with hardware timestamping of RX packets, I found
> > that some packets were received by TCP stack with a ~200 ms delay...
> >
> > Since the timestamp was provided by the NIC, and my probe was added
> > in tcp_v4_rcv() while in BH handler, I was confident it was not
> > a sender issue, or a drop in the network.
> >
> > This would happen with a very low probability, but hurting RPC
> > workloads.
> >
> > I could track this down to the cx-3 (mlx4 driver) card that was
> > apparently sending an IRQ before we would arm it.
> >
> 
> Hi Eric,
> 
> This is highly unlikely, the hardware should not do that, and if this
> is really the case
> we need to hunt down the root cause and not work around it.

Well, I definitely see the interrupt coming while the napi bit is not
available.


> 
> > A NAPI driver normally arms the IRQ after the napi_complete_done(),
> > after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> > it.
> >
> > This patch adds a new rx_irq_miss field that is incremented every time
> > the hard IRQ handler could not grab NAPI_STATE_SCHED.
> >
> > Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> > and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED
> 
> Are you sure this is not some kind of race condition with the busy
> polling mechanism
> Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
> Maybe the busy polling thread when it detects that it wants to yield,
> it arms the CQ too early (when napi is not ready)?
> 

Good question.

No busy polling in my tests.

I have triggers by using on a 2x10 Gbit host

(bonding of two 10Gbit mlx4 ports)


ethtool -C eth1 adaptive-rx on rx-usecs-low 0
ethtool -C eth2 adaptive-rx on rx-usecs-low 0
./super_netperf 9 -H lpaa6 -t TCP_RR -l 1 -- -r 100,100 &


4 rx and tx queues per NIC, fq packet scheduler on them.

TCP timestamps are on. (this might be important to get the last packet
of a given size. In my case 912 bytes, with a PSH flag)


> Anyway Tariq and I would like to further investigate the fired IRQ
> while CQ is not armed.  It smells
> like  a bug in the driver/napi level, it is not a HW expected behavior.
> 
> Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
> counter advance on a linerate RX load ?


About 1000 times per second on my hosts, receiving about 1.2 Mpps.
But most of these misses are not an issue because next packet is
arriving maybe less than 10 usec later.

Note that the bug is hard to notice, because TCP would fast retransmit,
or the next packet was coming soon enough.

You have to be unlucky enough that the RX queue that missed the NAPI
schedule receive no more packets before the 200 ms RTO timer, and the
packet stuck in the RX ring is the last packet of the RPC.


I believe part of the problem is that NAPI_STATE_SCHED can be grabbed by
a process while hard irqs were not disabled.

I do not believe this bug is mlx4 specific.

Anything doing the following while hard irq were not masked :

local_bh_disable();
napi_reschedule(&priv->rx_cq[ring]->napi);
local_bh_enable();

Like in mlx4_en_recover_from_oom()

Can trigger the issue really.

Unfortunately I do not see how core layer can handle this.
Only the driver hard irq could possibly know that it could not grab
NAPI_STATE_SCHED






Re: [PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

2017-02-26 Thread Saeed Mahameed
On Sat, Feb 25, 2017 at 4:22 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> While playing with hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
>
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
>
> This would happen with a very low probability, but hurting RPC
> workloads.
>
> I could track this down to the cx-3 (mlx4 driver) card that was
> apparently sending an IRQ before we would arm it.
>

Hi Eric,

This is highly unlikely, the hardware should not do that, and if this
is really the case
we need to hunt down the root cause and not work around it.

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
>
> This patch adds a new rx_irq_miss field that is incremented every time
> the hard IRQ handler could not grab NAPI_STATE_SCHED.
>
> Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
> and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Are you sure this is not some kind of race condition with the busy
polling mechanism
Introduced in ("net/mlx4_en: use napi_complete_done() return value") ?
Maybe the busy polling thread when it detects that it wants to yield,
it arms the CQ too early (when napi is not ready)?

Anyway Tariq and I would like to further investigate the fired IRQ
while CQ is not armed.  It smells
like  a bug in the driver/napi level, it is not a HW expected behavior.

Any pointers on how to reproduce ?  how often would the "rx_irq_miss"
counter advance on a linerate RX load ?

>
> Note that this work around would probably not work if the IRQ is spread
> over many cpus, since it assume the hard irq and softirq are handled by
> the same cpu. This kind of setup is buggy anyway because of reordering
> issues.
>
> Signed-off-by: Eric Dumazet 
> Cc: Tariq Toukan 
> Cc: Saeed Mahameed 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |1
>  2 files changed, 26 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 
> 867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
> struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
> struct mlx4_en_priv *priv = netdev_priv(cq->dev);
>
> -   if (likely(priv->port_up))
> -   napi_schedule_irqoff(&cq->napi);
> -   else
> +   if (likely(priv->port_up)) {
> +   if (napi_schedule_prep(&cq->napi))
> +   __napi_schedule_irqoff(&cq->napi);
> +   else
> +   cq->rx_irq_missed++;
> +   } else {
> mlx4_en_arm_cq(priv, cq);
> +   }
>  }
>
>  /* Rx CQ polling - called by NAPI */
> @@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int 
> budget)
> struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
> struct net_device *dev = cq->dev;
> struct mlx4_en_priv *priv = netdev_priv(dev);
> -   int done;
> +   int done = 0;
> +   u32 rx_irq_missed;
>
> -   done = mlx4_en_process_rx_cq(dev, cq, budget);
> +again:
> +   rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
> +   done += mlx4_en_process_rx_cq(dev, cq, budget - done);
>
> /* If we used up all the quota - we're probably not done yet... */
> if (done == budget) {
> @@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int 
> budget)
>  */
> if (done)
> done--;
> +   if (napi_complete_done(napi, done))
> +   mlx4_en_arm_cq(priv, cq);
> +   return done;
> }
> -   /* Done for now */
> -   if (napi_complete_done(napi, done))
> +   if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
> +   goto again;
> +
> +   if (napi_complete_done(napi, done)) {
> mlx4_en_arm_cq(priv, cq);
> +
> +   /* We might have received an interrupt too soon */
> +   if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
> +   napi_reschedule(napi))
> +   goto again;
> +   }
> return done;
>  }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
> b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 
> 4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_

Re: [PATCH net] net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

2017-02-25 Thread Or Gerlitz
On Fri, Feb 24, 2017 at 6:21 PM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Thu, 23 Feb 2017

> Tariq please review.

Dave,

Just to re-iterate what we wrote here couple of time, the IL WW is
Sun-Thu on GMT+2 hours and hence this patch was sent when the
developers/maintainer are into weekend mode.

Or.


[PATCH net] net/mlx4_en: reception NAPI/IRQ race breaker

2017-02-25 Thread Eric Dumazet
From: Eric Dumazet 

While playing with hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

I could track this down to the cx-3 (mlx4 driver) card that was
apparently sending an IRQ before we would arm it.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

This patch adds a new rx_irq_miss field that is incremented every time
the hard IRQ handler could not grab NAPI_STATE_SCHED.

Then, mlx4_en_poll_rx_cq() is able to detect that rx_irq was incremented
and attempts to read more packets, if it can re-acquire NAPI_STATE_SCHED

Note that this work around would probably not work if the IRQ is spread
over many cpus, since it assume the hard irq and softirq are handled by
the same cpu. This kind of setup is buggy anyway because of reordering
issues.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |   32 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |1 
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 
867292880c07a15124a0cf099d1fcda09926548e..7c262dc6a9971ca99a890bc3cff49289f0ef43fc
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1132,10 +1132,14 @@ void mlx4_en_rx_irq(struct mlx4_cq *mcq)
struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq);
struct mlx4_en_priv *priv = netdev_priv(cq->dev);
 
-   if (likely(priv->port_up))
-   napi_schedule_irqoff(&cq->napi);
-   else
+   if (likely(priv->port_up)) {
+   if (napi_schedule_prep(&cq->napi))
+   __napi_schedule_irqoff(&cq->napi);
+   else
+   cq->rx_irq_missed++;
+   } else {
mlx4_en_arm_cq(priv, cq);
+   }
 }
 
 /* Rx CQ polling - called by NAPI */
@@ -1144,9 +1148,12 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int 
budget)
struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi);
struct net_device *dev = cq->dev;
struct mlx4_en_priv *priv = netdev_priv(dev);
-   int done;
+   int done = 0;
+   u32 rx_irq_missed;
 
-   done = mlx4_en_process_rx_cq(dev, cq, budget);
+again:
+   rx_irq_missed = READ_ONCE(cq->rx_irq_missed);
+   done += mlx4_en_process_rx_cq(dev, cq, budget - done);
 
/* If we used up all the quota - we're probably not done yet... */
if (done == budget) {
@@ -1171,10 +1178,21 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int 
budget)
 */
if (done)
done--;
+   if (napi_complete_done(napi, done))
+   mlx4_en_arm_cq(priv, cq);
+   return done;
}
-   /* Done for now */
-   if (napi_complete_done(napi, done))
+   if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed))
+   goto again;
+
+   if (napi_complete_done(napi, done)) {
mlx4_en_arm_cq(priv, cq);
+
+   /* We might have received an interrupt too soon */
+   if (unlikely(READ_ONCE(cq->rx_irq_missed) != rx_irq_missed) &&
+   napi_reschedule(napi))
+   goto again;
+   }
return done;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..030f305fcca7fa4b3b9a15be3fe11e572d665003
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -369,6 +369,7 @@ struct mlx4_en_cq {
int ring;
struct net_device  *dev;
struct napi_struct  napi;
+   u32 rx_irq_missed;
int size;
int buf_size;
int vector;




Re: [PATCH net] net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

2017-02-24 Thread David Miller
From: Eric Dumazet 
Date: Thu, 23 Feb 2017 15:22:43 -0800

> From: Eric Dumazet 
> 
> The cited commit makes a great job of finding optimal shift/multiplier
> values assuming a 10 seconds wrap around, but forgot to change the
> overflow_period computation.
> 
> It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
> which is silly.
> 
> Lets simply use 5 seconds, no need to recompute this, given how it is
> supposed to work.
> 
> Later, we will use a timer instead of a work queue, since the new RX
> allocation schem will no longer need mlx4_en_recover_from_oom() and the
> service_task firing every 250 ms.
> 
> Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according 
> to HW frequency")
> Signed-off-by: Eric Dumazet 

Tariq please review.


[PATCH net] net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

2017-02-23 Thread Eric Dumazet
From: Eric Dumazet 

The cited commit makes a great job of finding optimal shift/multiplier
values assuming a 10 seconds wrap around, but forgot to change the
overflow_period computation.

It overflows in cyclecounter_cyc2ns(), and the final result is 804 ms,
which is silly.

Lets simply use 5 seconds, no need to recompute this, given how it is
supposed to work.

Later, we will use a timer instead of a work queue, since the new RX
allocation schem will no longer need mlx4_en_recover_from_oom() and the
service_task firing every 250 ms.

Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according 
to HW frequency")
Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
Cc: Eugenia Emantayev 
---
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |   18 +++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |1 
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c 
b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
index 
e7b81a305469e64b97f68bc0e2bcb064b78f08fe..024788549c2569a13d3a07ebbde718cccf980a26
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
@@ -89,10 +89,17 @@ void mlx4_en_remove_timestamp(struct mlx4_en_dev *mdev)
}
 }
 
+#define MLX4_EN_WRAP_AROUND_SEC10UL
+/* By scheduling the overflow check every 5 seconds, we have a reasonably
+ * good chance we wont miss a wrap around.
+ * TOTO: Use a timer instead of a work queue to increase the guarantee.
+ */
+#define MLX4_EN_OVERFLOW_PERIOD (MLX4_EN_WRAP_AROUND_SEC * HZ / 2)
+
 void mlx4_en_ptp_overflow_check(struct mlx4_en_dev *mdev)
 {
bool timeout = time_is_before_jiffies(mdev->last_overflow_check +
- mdev->overflow_period);
+ MLX4_EN_OVERFLOW_PERIOD);
unsigned long flags;
 
if (timeout) {
@@ -237,7 +244,6 @@ static const struct ptp_clock_info mlx4_en_ptp_clock_info = 
{
.enable = mlx4_en_phc_enable,
 };
 
-#define MLX4_EN_WRAP_AROUND_SEC10ULL
 
 /* This function calculates the max shift that enables the user range
  * of MLX4_EN_WRAP_AROUND_SEC values in the cycles register.
@@ -258,7 +264,6 @@ void mlx4_en_init_timestamp(struct mlx4_en_dev *mdev)
 {
struct mlx4_dev *dev = mdev->dev;
unsigned long flags;
-   u64 ns, zero = 0;
 
/* mlx4_en_init_timestamp is called for each netdev.
 * mdev->ptp_clock is common for all ports, skip initialization if
@@ -282,13 +287,6 @@ void mlx4_en_init_timestamp(struct mlx4_en_dev *mdev)
 ktime_to_ns(ktime_get_real()));
write_sequnlock_irqrestore(&mdev->clock_lock, flags);
 
-   /* Calculate period in seconds to call the overflow watchdog - to make
-* sure counter is checked at least once every wrap around.
-*/
-   ns = cyclecounter_cyc2ns(&mdev->cycles, mdev->cycles.mask, zero, &zero);
-   do_div(ns, NSEC_PER_SEC / 2 / HZ);
-   mdev->overflow_period = ns;
-
/* Configure the PHC */
mdev->ptp_clock_info = mlx4_en_ptp_clock_info;
snprintf(mdev->ptp_clock_info.name, 16, "mlx4 ptp");
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 
4941b692e9479bc7800e92f9bfb13baf3ff7d0d4..3629ce11a68b9dec5c1659539bdc6f2c4e35
 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -430,7 +430,6 @@ struct mlx4_en_dev {
seqlock_t   clock_lock;
struct timecounter  clock;
unsigned long   last_overflow_check;
-   unsigned long   overflow_period;
struct ptp_clock*ptp_clock;
struct ptp_clock_info   ptp_clock_info;
struct notifier_block   nb;





Re: [PATCH net] net/mlx4_en: Fix user prio field in XDP forward

2016-12-22 Thread David Miller
From: Tariq Toukan 
Date: Thu, 22 Dec 2016 14:32:58 +0200

> The user prio field is wrong (and overflows) in the XDP forward
> flow.
> This is a result of a bad value for num_tx_rings_p_up, which should
> account all XDP TX rings, as they operate for the same user prio.
> 
> Signed-off-by: Tariq Toukan 
> Reported-by: Martin KaFai Lau 

Applied.


[PATCH net] net/mlx4_en: Fix user prio field in XDP forward

2016-12-22 Thread Tariq Toukan
The user prio field is wrong (and overflows) in the XDP forward
flow.
This is a result of a bad value for num_tx_rings_p_up, which should
account all XDP TX rings, as they operate for the same user prio.

Signed-off-by: Tariq Toukan 
Reported-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index bcd955339058..edbe200ac2fa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1638,7 +1638,8 @@ int mlx4_en_start_port(struct net_device *dev)
 
/* Configure tx cq's and rings */
for (t = 0 ; t < MLX4_EN_NUM_TX_TYPES; t++) {
-   u8 num_tx_rings_p_up = t == TX ? priv->num_tx_rings_p_up : 1;
+   u8 num_tx_rings_p_up = t == TX ?
+   priv->num_tx_rings_p_up : priv->tx_ring_num[t];
 
for (i = 0; i < priv->tx_ring_num[t]; i++) {
/* Configure cq */
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: Free netdev resources under state lock

2016-11-23 Thread David Miller
From: Tariq Toukan 
Date: Tue, 22 Nov 2016 16:20:39 +0200

> Make sure mlx4_en_free_resources is called under the netdev state lock.
> This is needed since RCU dereference of XDP prog should be protected.
> 
> Fixes: 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock")
> Signed-off-by: Tariq Toukan 
> Reported-by: Sagi Grimberg 

Applied.


[PATCH net] net/mlx4_en: Free netdev resources under state lock

2016-11-22 Thread Tariq Toukan
Make sure mlx4_en_free_resources is called under the netdev state lock.
This is needed since RCU dereference of XDP prog should be protected.

Fixes: 326fe02d1ed6 ("net/mlx4_en: protect ring->xdp_prog with rcu_read_lock")
Signed-off-by: Tariq Toukan 
Reported-by: Sagi Grimberg 
CC: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 3a47e83d3e07..a60f635da78b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -129,6 +129,9 @@ static enum mlx4_net_trans_rule_id 
mlx4_ip_proto_to_trans_rule_id(u8 ip_proto)
}
 };
 
+/* Must not acquire state_lock, as its corresponding work_sync
+ * is done under it.
+ */
 static void mlx4_en_filter_work(struct work_struct *work)
 {
struct mlx4_en_filter *filter = container_of(work,
@@ -2189,13 +2192,13 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
mutex_lock(&mdev->state_lock);
mdev->pndev[priv->port] = NULL;
mdev->upper[priv->port] = NULL;
-   mutex_unlock(&mdev->state_lock);
 
 #ifdef CONFIG_RFS_ACCEL
mlx4_en_cleanup_filters(priv);
 #endif
 
mlx4_en_free_resources(priv);
+   mutex_unlock(&mdev->state_lock);
 
kfree(priv->tx_ring);
kfree(priv->tx_cq);
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: fixup xdp tx irq to match rx

2016-10-14 Thread David Miller
From: Brenden Blanco 
Date: Thu, 13 Oct 2016 13:13:11 -0700

> In cases where the number of tx rings is not a multiple of the number of
> rx rings, the tx completion event will be handled on a different core
> from the transmit and population of the ring. Races on the ring will
> lead to a double-free of the page, and possibly other corruption.
> 
> The rings are initialized by default with a valid multiple of rings,
> based on the number of cpus, therefore an invalid configuration requires
> ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
> 8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
> completed on cpu3 (48 % 9 == 3).
> 
> Resolve this discrepancy by shifting the irq for the xdp tx queues to
> start again from 0, modulo rx_ring_num.
> 
> Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
> Reported-by: Jesper Dangaard Brouer 
> Signed-off-by: Brenden Blanco 

Applied and queued up for -stable, thanks.


[PATCH net] net/mlx4_en: fixup xdp tx irq to match rx

2016-10-13 Thread Brenden Blanco
In cases where the number of tx rings is not a multiple of the number of
rx rings, the tx completion event will be handled on a different core
from the transmit and population of the ring. Races on the ring will
lead to a double-free of the page, and possibly other corruption.

The rings are initialized by default with a valid multiple of rings,
based on the number of cpus, therefore an invalid configuration requires
ethtool to change the ring layout. For instance 'ethtool -L eth0 rx 9 tx
8' will cause packets received on rx0, and XDP_TX'd to tx48, to be
completed on cpu3 (48 % 9 == 3).

Resolve this discrepancy by shifting the irq for the xdp tx queues to
start again from 0, modulo rx_ring_num.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Reported-by: Jesper Dangaard Brouer 
Signed-off-by: Brenden Blanco 
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_cq.c 
b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
index 132cea6..e3be7e4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -127,7 +127,15 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct 
mlx4_en_cq *cq,
/* For TX we use the same irq per
ring we assigned for the RX*/
struct mlx4_en_cq *rx_cq;
-
+   int xdp_index;
+
+   /* The xdp tx irq must align with the rx ring that forwards to
+* it, so reindex these from 0. This should only happen when
+* tx_ring_num is not a multiple of rx_ring_num.
+*/
+   xdp_index = (priv->xdp_ring_num - priv->tx_ring_num) + cq_idx;
+   if (xdp_index >= 0)
+   cq_idx = xdp_index;
cq_idx = cq_idx % priv->rx_ring_num;
rx_cq = priv->rx_cq[cq_idx];
cq->vector = rx_cq->vector;
-- 
2.9.3



Re: [PATCH net] net/mlx4_en: initialize cmd.context_lock spinlock earlier

2016-06-15 Thread David Miller
From: Eric Dumazet 
Date: Mon, 13 Jun 2016 07:50:25 -0700

> From: Eric Dumazet 
> 
> Maciej Żenczykowski reported lockdep warning a spinlock
> was not registered before being held in mlx4_cmd_wake_completions()
> 
> cmd.context_lock initialization is not at the right place.
> 
> 1) mlx4_cmd_use_events() can be called multiple times.
>Calling spin_lock_init() on a live spinlock can lead
>to hangs.
> 
> 2) mlx4_cmd_wake_completions() can be called while lock
>has not been initialized.
>Lockdep complains, and current logic is not race prone.
> 
> It seems better to move the initialization earlier in
> mlx4_load_one()
> 
> Signed-off-by: Eric Dumazet 
> Reported-by: Maciej Żenczykowski 

Applied.


[PATCH net] net/mlx4_en: initialize cmd.context_lock spinlock earlier

2016-06-13 Thread Eric Dumazet
From: Eric Dumazet 

Maciej Żenczykowski reported lockdep warning a spinlock
was not registered before being held in mlx4_cmd_wake_completions()

cmd.context_lock initialization is not at the right place.

1) mlx4_cmd_use_events() can be called multiple times.
   Calling spin_lock_init() on a live spinlock can lead
   to hangs.

2) mlx4_cmd_wake_completions() can be called while lock
   has not been initialized.
   Lockdep complains, and current logic is not race prone.

It seems better to move the initialization earlier in
mlx4_load_one()

Signed-off-by: Eric Dumazet 
Reported-by: Maciej Żenczykowski 
Cc: Eugenia Emantayev 
Cc: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c  |1 -
 drivers/net/ethernet/mellanox/mlx4/main.c |1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index e94ca1c3fc7c..f04a423ff79d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2597,7 +2597,6 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev)
priv->cmd.free_head = 0;
 
sema_init(&priv->cmd.event_sem, priv->cmd.max_cmds);
-   spin_lock_init(&priv->cmd.context_lock);
 
for (priv->cmd.token_mask = 1;
 priv->cmd.token_mask < priv->cmd.max_cmds;
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 12c77a70abdb..372ebfa880f5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3222,6 +3222,7 @@ static int mlx4_load_one(struct pci_dev *pdev, int 
pci_dev_data,
 
INIT_LIST_HEAD(&priv->pgdir_list);
mutex_init(&priv->pgdir_mutex);
+   spin_lock_init(&priv->cmd.context_lock);
 
INIT_LIST_HEAD(&priv->bf_list);
mutex_init(&priv->bf_mutex);




Re: [PATCH net] net/mlx4_en: Fix endianness bug in IPV6 csum calculation

2016-05-05 Thread David Miller
From: Tariq Toukan 
Date: Wed,  4 May 2016 15:00:33 +0300

> From: Daniel Jurgens 
> 
> Use htons instead of unconditionally byte swapping nexthdr.  On a little
> endian systems shifting the byte is correct behavior, but it results in
> incorrect csums on big endian architectures.
> 
> Fixes: f8c6455bb04b ('net/mlx4_en: Extend checksum offloading by CHECKSUM 
> COMPLETE')
> Signed-off-by: Daniel Jurgens 
> Reviewed-by: Carol Soto 
> Tested-by: Carol Soto 
> Signed-off-by: Tariq Toukan 

Applied and queued up for -stable, thanks.


[PATCH net] net/mlx4_en: Fix endianness bug in IPV6 csum calculation

2016-05-04 Thread Tariq Toukan
From: Daniel Jurgens 

Use htons instead of unconditionally byte swapping nexthdr.  On a little
endian systems shifting the byte is correct behavior, but it results in
incorrect csums on big endian architectures.

Fixes: f8c6455bb04b ('net/mlx4_en: Extend checksum offloading by CHECKSUM 
COMPLETE')
Signed-off-by: Daniel Jurgens 
Reviewed-by: Carol Soto 
Tested-by: Carol Soto 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index b723e3b..ca3a384 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -707,7 +707,7 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct 
sk_buff *skb,
 
if (ipv6h->nexthdr == IPPROTO_FRAGMENT || ipv6h->nexthdr == 
IPPROTO_HOPOPTS)
return -1;
-   hw_checksum = csum_add(hw_checksum, (__force __wsum)(ipv6h->nexthdr << 
8));
+   hw_checksum = csum_add(hw_checksum, (__force 
__wsum)htons(ipv6h->nexthdr));
 
csum_pseudo_hdr = csum_partial(&ipv6h->saddr,
   sizeof(ipv6h->saddr) + 
sizeof(ipv6h->daddr), 0);
-- 
1.8.3.1



Re: [PATCH net] net/mlx4_en: fix spurious timestamping callbacks

2016-04-25 Thread David Miller
From: Eric Dumazet 
Date: Sat, 23 Apr 2016 11:35:46 -0700

> From: Eric Dumazet 
> 
> When multiple skb are TX-completed in a row, we might incorrectly keep
> a timestamp of a prior skb and cause extra work.
> 
> Fixes: ec693d47010e8 ("net/mlx4_en: Add HW timestamping (TS) support")
> Signed-off-by: Eric Dumazet 
> Willem de Bruijn 

Applied and queued up for -stable.


Re: [PATCH net] net/mlx4_en: fix spurious timestamping callbacks

2016-04-24 Thread Eran Ben Elisha
On Sat, Apr 23, 2016 at 9:35 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> When multiple skb are TX-completed in a row, we might incorrectly keep
> a timestamp of a prior skb and cause extra work.
>
> Fixes: ec693d47010e8 ("net/mlx4_en: Add HW timestamping (TS) support")
> Signed-off-by: Eric Dumazet 
> Willem de Bruijn 

Other than the commit message issue raised by Or,

Reviewed-by: Eran Ben Elisha 


Re: [PATCH net] net/mlx4_en: fix spurious timestamping callbacks

2016-04-23 Thread Eric Dumazet
On Sat, 2016-04-23 at 23:23 +0300, Or Gerlitz wrote:
> On Sat, Apr 23, 2016 at 9:35 PM, Eric Dumazet  wrote:
> > From: Eric Dumazet 
> >
> > When multiple skb are TX-completed in a row, we might incorrectly keep
> > a timestamp of a prior skb and cause extra work.
> >
> > Fixes: ec693d47010e8 ("net/mlx4_en: Add HW timestamping (TS) support")
> > Signed-off-by: Eric Dumazet 
> > Willem de Bruijn 
> 
> Eric, was that supposed to be Reported-by: or Signed-off-by: ...

This was meant to be.

Cc: Willem de Bruijn 

Willem was CC on the original bug at Google (Google-Bug-Id: 28356763 )

Thanks.




Re: [PATCH net] net/mlx4_en: fix spurious timestamping callbacks

2016-04-23 Thread Or Gerlitz
On Sat, Apr 23, 2016 at 9:35 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> When multiple skb are TX-completed in a row, we might incorrectly keep
> a timestamp of a prior skb and cause extra work.
>
> Fixes: ec693d47010e8 ("net/mlx4_en: Add HW timestamping (TS) support")
> Signed-off-by: Eric Dumazet 
> Willem de Bruijn 

Eric, was that supposed to be Reported-by: or Signed-off-by: ...

FWIW, Dave, I wanted to use the opportunity and comment that this week
many of us gonna be on Passover vacation, so response rate should be
lower than usual


>  drivers/net/ethernet/mellanox/mlx4/en_tx.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
> b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index c0d7b7296236..a386f047c1af 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -405,7 +405,6 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> u32 packets = 0;
> u32 bytes = 0;
> int factor = priv->cqe_factor;
> -   u64 timestamp = 0;
> int done = 0;
> int budget = priv->tx_work_limit;
> u32 last_nr_txbb;
> @@ -445,9 +444,12 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
> new_index = be16_to_cpu(cqe->wqe_index) & size_mask;
>
> do {
> +   u64 timestamp = 0;
> +
> txbbs_skipped += last_nr_txbb;
> ring_index = (ring_index + last_nr_txbb) & size_mask;
> -   if (ring->tx_info[ring_index].ts_requested)
> +
> +   if (unlikely(ring->tx_info[ring_index].ts_requested))
> timestamp = mlx4_en_get_cqe_ts(cqe);
>
> /* free next descriptor */
>
>


[PATCH net] net/mlx4_en: fix spurious timestamping callbacks

2016-04-23 Thread Eric Dumazet
From: Eric Dumazet 

When multiple skb are TX-completed in a row, we might incorrectly keep
a timestamp of a prior skb and cause extra work.

Fixes: ec693d47010e8 ("net/mlx4_en: Add HW timestamping (TS) support")
Signed-off-by: Eric Dumazet 
Willem de Bruijn 
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c0d7b7296236..a386f047c1af 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -405,7 +405,6 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
u32 packets = 0;
u32 bytes = 0;
int factor = priv->cqe_factor;
-   u64 timestamp = 0;
int done = 0;
int budget = priv->tx_work_limit;
u32 last_nr_txbb;
@@ -445,9 +444,12 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
new_index = be16_to_cpu(cqe->wqe_index) & size_mask;
 
do {
+   u64 timestamp = 0;
+
txbbs_skipped += last_nr_txbb;
ring_index = (ring_index + last_nr_txbb) & size_mask;
-   if (ring->tx_info[ring_index].ts_requested)
+
+   if (unlikely(ring->tx_info[ring_index].ts_requested))
timestamp = mlx4_en_get_cqe_ts(cqe);
 
/* free next descriptor */




Re: [PATCH net] net/mlx4_en:

2015-09-15 Thread Eric Dumazet

Arg, patch title was meant to be

net/mlx4_en: really allow to change RSS key


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] net/mlx4_en:

2015-09-15 Thread Eric Dumazet
From: Eric Dumazet 

When changing rss key, we do not want to overwrite user provided key
by the one provided by netdev_rss_key_fill(), which is the host random
key generated at boot time.

Fixes: 947cbb0ac242 ("net/mlx4_en: Support for configurable RSS hash function")
Signed-off-by: Eric Dumazet 
Cc: Eyal Perry 
CC: Amir Vadai 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 4402a1e48c9b..0ce6ffe73ca8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1268,8 +1268,6 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
rss_context->hash_fn = MLX4_RSS_HASH_TOP;
memcpy(rss_context->rss_key, priv->rss_key,
   MLX4_EN_RSS_KEY_SIZE);
-   netdev_rss_key_fill(rss_context->rss_key,
-   MLX4_EN_RSS_KEY_SIZE);
} else {
en_err(priv, "Unknown RSS hash function requested\n");
err = -EINVAL;


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] net/mlx4_en: Schedule napi when RX buffers allocation fails

2015-04-30 Thread Amir Vadai
From: Ido Shamay 

When system is out of memory, refilling of RX buffers fails while
the driver continue to pass the received packets to the kernel stack.
At some point, when all RX buffers deplete, driver may fall into a
sleep, and not recover when memory for new RX buffers is once again
availible. This is because hardware does not have valid descriptors,
so no interrupt will be generated for the driver to return to work
in napi context. Fix it by schedule the napi poll function from
stats_task delayed workqueue, as long as the allocations fail.

Signed-off-by: Ido Shamay 
Signed-off-by: Amir Vadai 
---
Hi Dave,

This is a bug that our driver had from day one. Please pull it into -stable.

Thanks,
Amir

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 --
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 0f1afc0..a23a96a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1467,6 +1467,7 @@ static void mlx4_en_service_task(struct work_struct *work)
if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
mlx4_en_ptp_overflow_check(mdev);
 
+   mlx4_en_recover_from_oom(priv);
queue_delayed_work(mdev->workqueue, &priv->service_task,
   SERVICE_TASK_DELAY);
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 4fdd3c3..2a77a6b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -244,6 +244,12 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
*priv,
return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
 }
 
+static inline bool mlx4_en_is_ring_empty(struct mlx4_en_rx_ring *ring)
+{
+   BUG_ON((u32)(ring->prod - ring->cons) > ring->actual_size);
+   return ring->prod == ring->cons;
+}
+
 static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring)
 {
*ring->wqres.db.db = cpu_to_be32(ring->prod & 0x);
@@ -315,8 +321,7 @@ static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv,
   ring->cons, ring->prod);
 
/* Unmap and free Rx buffers */
-   BUG_ON((u32) (ring->prod - ring->cons) > ring->actual_size);
-   while (ring->cons != ring->prod) {
+   while (!mlx4_en_is_ring_empty(ring)) {
index = ring->cons & ring->size_mask;
en_dbg(DRV, priv, "Processing descriptor:%d\n", index);
mlx4_en_free_rx_desc(priv, ring, index);
@@ -491,6 +496,23 @@ err_allocator:
return err;
 }
 
+/* We recover from out of memory by scheduling our napi poll
+ * function (mlx4_en_process_cq), which tries to allocate
+ * all missing RX buffers (call to mlx4_en_refill_rx_buffers).
+ */
+void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
+{
+   int ring;
+
+   if (!priv->port_up)
+   return;
+
+   for (ring = 0; ring < priv->rx_ring_num; ring++) {
+   if (mlx4_en_is_ring_empty(priv->rx_ring[ring]))
+   napi_reschedule(&priv->rx_cq[ring]->napi);
+   }
+}
+
 void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 struct mlx4_en_rx_ring **pring,
 u32 size, u16 stride)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 9de3021..d021f07 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -774,6 +774,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
struct mlx4_en_tx_ring *ring);
 void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev);
+void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv);
 int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
   struct mlx4_en_rx_ring **pring,
   u32 size, u16 stride, int node);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html