Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-06-08 Thread Lino Sanfilippo

Hi Shuyu,

On 05.06.2016 16:02, Shuyu Wei wrote:
> Hi Lino,
> Sorry for my late reply. I retested the previous patch, it did have
> the same issue. However it seems that the possibility of stuck is lower
> (just my instinct, no evidence).
> 

Thank you for the feedback. It is hard to guess what is still going wrong.
But if you - as I assume - dont see this issue with the original code (as it is
in net-next) there is still something wrong with the changes we made. So we
probably should leave the code as it is for now.

Regards,
Lino  



Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-06-05 Thread Shuyu Wei
On Mon, May 30, 2016 at 11:41:22PM +0200, Lino Sanfilippo wrote:
> 
> Did you see the same issues with the patch before (the one that, as you wrote,
> survived a whole night of stress testing)?
> 
> Lino

Hi Lino,
Sorry for my late reply. I retested the previous patch, it did have
the same issue. However it seems that the possibility of stuck is lower
(just my instinct, no evidence).


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-30 Thread Lino Sanfilippo
Hi Shuyu,

On 28.05.2016 08:43, Shuyu Wei wrote:
> 
> After some stress testing, it worked well most of the time.
> But there is a chance that it may get stuck when I use 2 nc process
> to send TCP packects at full speed.  Only when a new rx packet 
> arrive can trigger it to run again. This happens only once per several
> hours. No problem in UDP mode.  I'm not sure if it's related to tx code in
> the driver.
> 

This sounds strange. One reason I could imagine for such an issue is that 
occassionally tx completion interrupts get lost: 
In this case skbs may not be freed (or be freed too late) which may result in a 
TCP
stream getting stuck, because each skb belongs to a socket and there is a limit 
for
the allowed number of skbs per socket. TCP would only proceed if the number of 
pending 
skbs falls under this limit again but since the tx completion irq is lost, the 
only thing 
that could trigger the tx completion handler is another irq, e.g for a received 
packet.
At least this could explain what you observed.

Did you see the same issues with the patch before (the one that, as you wrote,
survived a whole night of stress testing)?

Lino


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-28 Thread Shuyu Wei
On Wed, May 25, 2016 at 01:56:20AM +0200, Lino Sanfilippo wrote:
> Francois, Shuyu,
> 
> this is the patch with the discussed changes.
> 
> Shuyu it would be great if you could test this one. If it passes
> and there are no further objections I will resend it as a regular patch
> (including commit message, etc.) to the mailing list.
> 
> 
> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> b/drivers/net/ethernet/arc/emac_main.c
> index a3a9392..ec656b3 100644
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -153,18 +153,29 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>  {
>   struct arc_emac_priv *priv = netdev_priv(ndev);
>   struct net_device_stats *stats = >stats;
> + unsigned int curr = priv->txbd_curr;
>   unsigned int i;
>  
> + /* Make sure buffers and txbd_curr are consistent */
> + smp_rmb();
> +
>   for (i = 0; i < TX_BD_NUM; i++) {
>   unsigned int *txbd_dirty = >txbd_dirty;
>   struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
>   struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
> - struct sk_buff *skb = tx_buff->skb;
> - unsigned int info = le32_to_cpu(txbd->info);
> + unsigned int info;
> + struct sk_buff *skb;
>  
> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (*txbd_dirty == curr)
>   break;
>  
> + info = le32_to_cpu(txbd->info);
> +
> + if (info & FOR_EMAC)
> + break;
> +
> + skb = tx_buff->skb;
> +
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>   stats->tx_errors++;
>   stats->tx_dropped++;
> @@ -195,8 +206,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   *txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>   }
>  
> - /* Ensure that txbd_dirty is visible to tx() before checking
> -  * for queue stopped.
> + /* Ensure that txbd_dirty is visible to tx() and we see the most recent
> +  * value for txbd_curr.
>*/
>   smp_mb();
>  
> @@ -680,27 +691,24 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
>   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> -
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
>   skb_tx_timestamp(skb);
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
>  
> - /* Make sure info word is set */
> + /* 1. Make sure that with respect to tx_clean everything is set up
> +  * properly before we advance txbd_curr.
> +  * 2. Make sure writes to DMA descriptors are completed before we inform
> +  * the hardware.
> +  */
>   wmb();
>  
> - priv->tx_buff[*txbd_curr].skb = skb;
> -
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> -  * checking the queue status. This prevents an unneeded wake
> -  * of the queue in tx_clean().
> -  */
> - smp_mb();
> + /* Ensure tx_clean() sees the updated value of txbd_curr */
> + smp_wmb();
>  
>   if (!arc_emac_tx_avail(priv)) {
>   netif_stop_queue(ndev);

After some stress testing, it worked well most of the time.
But there is a chance that it may get stuck when I use 2 nc process
to send TCP packects at full speed.  Only when a new rx packet 
arrive can trigger it to run again. This happens only once per several
hours. No problem in UDP mode.  I'm not sure if it's related to tx code in
the driver.


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-24 Thread Lino Sanfilippo
Francois, Shuyu,

this is the patch with the discussed changes.

Shuyu it would be great if you could test this one. If it passes
and there are no further objections I will resend it as a regular patch
(including commit message, etc.) to the mailing list.


diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index a3a9392..ec656b3 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -153,18 +153,29 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 {
struct arc_emac_priv *priv = netdev_priv(ndev);
struct net_device_stats *stats = >stats;
+   unsigned int curr = priv->txbd_curr;
unsigned int i;
 
+   /* Make sure buffers and txbd_curr are consistent */
+   smp_rmb();
+
for (i = 0; i < TX_BD_NUM; i++) {
unsigned int *txbd_dirty = >txbd_dirty;
struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
-   struct sk_buff *skb = tx_buff->skb;
-   unsigned int info = le32_to_cpu(txbd->info);
+   unsigned int info;
+   struct sk_buff *skb;
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (*txbd_dirty == curr)
break;
 
+   info = le32_to_cpu(txbd->info);
+
+   if (info & FOR_EMAC)
+   break;
+
+   skb = tx_buff->skb;
+
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
stats->tx_errors++;
stats->tx_dropped++;
@@ -195,8 +206,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
}
 
-   /* Ensure that txbd_dirty is visible to tx() before checking
-* for queue stopped.
+   /* Ensure that txbd_dirty is visible to tx() and we see the most recent
+* value for txbd_curr.
 */
smp_mb();
 
@@ -680,27 +691,24 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
 
priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
-
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->tx_buff[*txbd_curr].skb = skb;
 
skb_tx_timestamp(skb);
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
+   /* 1. Make sure that with respect to tx_clean everything is set up
+* properly before we advance txbd_curr.
+* 2. Make sure writes to DMA descriptors are completed before we inform
+* the hardware.
+*/
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
-* checking the queue status. This prevents an unneeded wake
-* of the queue in tx_clean().
-*/
-   smp_mb();
+   /* Ensure tx_clean() sees the updated value of txbd_curr */
+   smp_wmb();
 
if (!arc_emac_tx_avail(priv)) {
netif_stop_queue(ndev);
-- 
1.9.1






Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-24 Thread Francois Romieu
Lino Sanfilippo  :
[...]
> I dont agree here. A dma_wmb would have an effect to "data" and "info", yes,
> but it would have absolutely no effect to skb_tx_timestamp(), since there
> is no dma access involved here. In fact skb_tx_timestamp() could probably
> be even reordered to appear after the dma_wmb.
>
> Anyway, there is the wmb() directly after the assignment to "info". _This_
> barrier  should ensure that skb_tx_timestamp() (along with a flush of data
> and info to DMA) is executed before "txbd_curr" is advanced.
> This means that the corresponding skb cant be freed prematurely by tx_clean().

The concern here is about sending adequate PTP payload on the network.

skb_tx_timestamp() is used for network clock synchronization. Some extra
information must be transmitted. Be it through direct payload change
or through indirect control, it _is_ related to dma.

Several (most ?) skb_tx_timestamp() misuses blur the picture: CPU vs device
sync is of course way below the radar when the driver crashes because of
plain use-after-free skb_tx_timestamp() :o/

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-23 Thread Lino Sanfilippo
On 23.05.2016 13:36, Shuyu Wei wrote:
> On Sun, May 22, 2016 at 01:30:27PM +0200, Lino Sanfilippo wrote:
>> 

> 
> Hi Lino, 
> This patch worked after a whole night of stress testing.
> 

Thats great! I will nevertheless make the changes discussed with Francois and 
hopefully we have a final
solution, soon.

Regards,
Lino




Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-23 Thread Lino Sanfilippo
On 23.05.2016 00:36, Francois Romieu wrote:
> Lino Sanfilippo  :
> [...]
>> --- a/drivers/net/ethernet/arc/emac_main.c
>> +++ b/drivers/net/ethernet/arc/emac_main.c
>> @@ -159,12 +159,22 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>>  unsigned int *txbd_dirty = >txbd_dirty;
>>  struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
>>  struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
>> -struct sk_buff *skb = tx_buff->skb;
>>  unsigned int info = le32_to_cpu(txbd->info);
>> +struct sk_buff *skb;
>>  
> 
> Insert a smp_rmb() here to close one window for an outdated txbd_dirty value
> (the "arc_emac_tx_clean wrote txbd_curr and issued smp_wmb" one).
> 
> Actually, insert smp_rmb() at the start of arc_emac_tx_clean() as it
> does not need to be performed withing the loop and would penalize it.

I agree, we should place the barrier before the loop. I also think it is indeed 
more
appropriate to use the SMP versions for both barriers.

> 
> Given the implicit smp barriers in the non-driver code, I consider
> "arc_emac_tx_clean on one CPU does not read latest txbd_dirty value written
> by previous arc_emac_tx_clean on different CPU" as utter onanism but
> issueing smp_rmb() at the start of arc_emac_tx_clean() nails it as well.
> 
>> -if ((info & FOR_EMAC) || !txbd->data || !skb)
>> +if (*txbd_dirty == priv->txbd_curr)
>>  break;
> 
> Ok, it's just the "while (priv->txbd_dirty != priv->txbd_curr) {" loop
> in disguise.

I cant deny that :)

> 
>>  
>> +/* Make sure curr pointer is consistent with info */
>> +rmb();
>> +
>> +info = le32_to_cpu(txbd->info);
>> +
>> +if (info & FOR_EMAC)
>> +break;
> 
> With proper ordering + barrier in arc_emac_tx() you can relax it to smp_rmb().

Ok.

>> +
>> +skb = tx_buff->skb;
>> +
>>  if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>>  stats->tx_errors++;
>>  stats->tx_dropped++;
>> @@ -195,8 +205,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>>  *txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>>  }
>>  
>> -/* Ensure that txbd_dirty is visible to tx() before checking
>> - * for queue stopped.
>> +/* Ensure that txbd_dirty is visible to tx() and we see the most recent
>> + * value for txbd_curr.
>>   */
>>  smp_mb();
>>  
>> @@ -680,35 +690,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
>> net_device *ndev)
>>  dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>>  
>>  priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
>> -
>> -/* Make sure pointer to data buffer is set */
>> -wmb();
>> +priv->tx_buff[*txbd_curr].skb = skb;
>>  
>>  skb_tx_timestamp(skb);
>>  
>>  *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> 
> No.
> 
> You need dma_wmb() after skb_tx_timestamp() to commit skb_tx_timestamp() [*]
> and data = cpu_to_le32(addr).

I dont agree here. A dma_wmb would have an effect to "data" and "info", yes, 
but it 
would have absolutely no effect to skb_tx_timestamp(), since there is no dma 
access
involved here. In fact skb_tx_timestamp() could probably be even reordered to 
appear
 after the dma_wmb.
Anyway, there is the wmb() directly after the assignment to "info". _This_ 
barrier
 should ensure that skb_tx_timestamp() (along with a flush of data and info to 
DMA)
is executed before "txbd_curr" is advanced.
This means that the corresponding skb cant be freed prematurely by tx_clean().

 
> 
> [*] I doubt anyone want a dma_sync_single_...() here.
> 
>>  
>> -/* Make sure info word is set */
>> +/* 1. Make sure that with respect to tx_clean everything is set up
>> + * properly before we advance txbd_curr.
>> + * 2. Make sure writes to DMA descriptors are completed before we inform
>> + * the hardware.
>> + */
>>  wmb();
> 
> Yes, either wmb() or smp_wmb() + dma_wmb().
> 

I really prefer one generic barrier over combos of 2 or more special barriers.

>>  
>> -priv->tx_buff[*txbd_curr].skb = skb;
>> -
>>  /* Increment index to point to the next BD */
>>  *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>>  
>> -/* Ensure that tx_clean() sees the new txbd_curr before
>> - * checking the queue status. This prevents an unneeded wake
>> - * of the queue in tx_clean().
>> +/* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
>> + * the updated value of txbd_curr.
>>   */
>>  smp_mb();
> 
> Nit: s/the most/a/
> 
> "a" as in "arc_emac_tx_clean() _is_ racing with arc_emac_tx"
> 
>>  
>> -if (!arc_emac_tx_avail(priv)) {
>> +if (!arc_emac_tx_avail(priv))
>>  netif_stop_queue(ndev);
>> -/* Refresh tx_dirty */
>> -smp_mb();
>> -if (arc_emac_tx_avail(priv))
>> -

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-23 Thread Shuyu Wei
On Sun, May 22, 2016 at 01:30:27PM +0200, Lino Sanfilippo wrote:
> 
> Thanks for testing. However that extra check for skb not being NULL should 
> not be
> necessary if the code were correct. The changes I suggested were all about 
> having
> skb and info consistent with txbd_curr.
> But I just realized that there is still a big flaw in the last changes. While
> tx() looks correct now (we first set up the descriptor and assign the skb and 
> _then_
> advance txbd_curr) tx_clean still is not:
> 
> We _first_ have to read tx_curr and _then_ read the corresponding descriptor 
> and its skb.
> (The last patch implemented just the reverse - and thus wrong - order, first 
> get skb and 
> descriptor and then read tx_curr).
> 
> So the patch below hopefully handles also tx_clean correctly. Could you 
> please do once more a test
> with this one?

Hi Lino, 
This patch worked after a whole night of stress testing.

> > 
> > After further test, my patch to barrier timestamp() didn't work.
> > Just like the original code in the tree, the emac still got stuck under
> > high load, even if I changed the smp_wmb() to dma_wmb(). So the original
> > code do have race somewhere.
> 
> So to make this clear: with the current code in net-next you still see a 
> problem (lockup), right?
Yes, I mean the mainline kernel, which should be the same as net-next.


> > ... and why Francois' fix worked. Please be patient with me :-).
> 
> So which fix(es) exactly work for you and solve your lockup issue?
I mean the patch below, starting this thread.

diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index a3a9392..df3dfef 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -153,9 +153,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 {
struct arc_emac_priv *priv = netdev_priv(ndev);
struct net_device_stats *stats = >stats;
-   unsigned int i;
 
-   for (i = 0; i < TX_BD_NUM; i++) {
+   while (priv->txbd_dirty != priv->txbd_curr) {
unsigned int *txbd_dirty = >txbd_dirty;
struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
@@ -685,13 +684,15 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
wmb();
 
skb_tx_timestamp(skb);
+   priv->tx_buff[*txbd_curr].skb = skb;
+
+   dma_wmb();
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
/* Make sure info word is set */
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
 
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;




Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-22 Thread Francois Romieu
Lino Sanfilippo  :
[...]
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -159,12 +159,22 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   unsigned int *txbd_dirty = >txbd_dirty;
>   struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
>   struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
> - struct sk_buff *skb = tx_buff->skb;
>   unsigned int info = le32_to_cpu(txbd->info);
> + struct sk_buff *skb;
>  

Insert a smp_rmb() here to close one window for an outdated txbd_dirty value
(the "arc_emac_tx_clean wrote txbd_curr and issued smp_wmb" one).

Actually, insert smp_rmb() at the start of arc_emac_tx_clean() as it
does not need to be performed withing the loop and would penalize it.

Given the implicit smp barriers in the non-driver code, I consider
"arc_emac_tx_clean on one CPU does not read latest txbd_dirty value written
by previous arc_emac_tx_clean on different CPU" as utter onanism but
issueing smp_rmb() at the start of arc_emac_tx_clean() nails it as well.

> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (*txbd_dirty == priv->txbd_curr)
>   break;

Ok, it's just the "while (priv->txbd_dirty != priv->txbd_curr) {" loop
in disguise.

>  
> + /* Make sure curr pointer is consistent with info */
> + rmb();
> +
> + info = le32_to_cpu(txbd->info);
> +
> + if (info & FOR_EMAC)
> + break;

With proper ordering + barrier in arc_emac_tx() you can relax it to smp_rmb().

> +
> + skb = tx_buff->skb;
> +
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>   stats->tx_errors++;
>   stats->tx_dropped++;
> @@ -195,8 +205,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   *txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>   }
>  
> - /* Ensure that txbd_dirty is visible to tx() before checking
> -  * for queue stopped.
> + /* Ensure that txbd_dirty is visible to tx() and we see the most recent
> +  * value for txbd_curr.
>*/
>   smp_mb();
>  
> @@ -680,35 +690,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
>   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> -
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
>   skb_tx_timestamp(skb);
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);

No.

You need dma_wmb() after skb_tx_timestamp() to commit skb_tx_timestamp() [*]
and data = cpu_to_le32(addr).

[*] I doubt anyone want a dma_sync_single_...() here.

>  
> - /* Make sure info word is set */
> + /* 1. Make sure that with respect to tx_clean everything is set up
> +  * properly before we advance txbd_curr.
> +  * 2. Make sure writes to DMA descriptors are completed before we inform
> +  * the hardware.
> +  */
>   wmb();

Yes, either wmb() or smp_wmb() + dma_wmb().

>  
> - priv->tx_buff[*txbd_curr].skb = skb;
> -
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> -  * checking the queue status. This prevents an unneeded wake
> -  * of the queue in tx_clean().
> + /* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
> +  * the updated value of txbd_curr.
>*/
>   smp_mb();

Nit: s/the most/a/

"a" as in "arc_emac_tx_clean() _is_ racing with arc_emac_tx"

>  
> - if (!arc_emac_tx_avail(priv)) {
> + if (!arc_emac_tx_avail(priv))
>   netif_stop_queue(ndev);
> - /* Refresh tx_dirty */
> - smp_mb();
> - if (arc_emac_tx_avail(priv))
> - netif_start_queue(ndev);
> - }

No.

I may sound like an old record but the revalidation part must be kept.

txbd_dirty may change in the arc_emac_tx_avail.. netif_stop_queue window
(the race requires a different CPU as arc_emac_tx() runs in locally
disabled BH context).

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-22 Thread Francois Romieu
Lino Sanfilippo  :
> On 21.05.2016 21:47, Francois Romieu wrote:
> > Shuyu Wei  :
> > [...]
> >> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> >> b/drivers/net/ethernet/arc/emac_main.c
> >> index a3a9392..c2447b0 100644
> >> --- a/drivers/net/ethernet/arc/emac_main.c
> >> +++ b/drivers/net/ethernet/arc/emac_main.c
> >> @@ -686,6 +686,9 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> >> net_device *ndev)
> >>  
> >> skb_tx_timestamp(skb);
> >>  
> >> +   /* Make sure timestamp is set */
> >> +   smp_wmb();
> > 
> > Should be dma_wmb() (see davem's message).
> > 
> > It's completely unrelated to SMP.
> > 
> 
> Its also completely unrelated to dma so I doubt that this is what davem meant.

It's related to dma: nobody wants the device to perform dma from memory
while the CPU is writing timestamp.

The device must enforce a commit of any network buffer memory write before
releasing control, i.e. before writing *info.

See 'git log -p --grep=dma_[rw]mb': it appears several times.

> As far as I understood he was referring to the dma descriptor.

If the wmb() between *info = ... and *txbd_curr = ... is replaced by a
dma_wmb(), the device will see a consistent descriptor when if performs
a DMA read after the CPU wrote into the mailbox (arc_reg_set(..., TXPL_MASK)).

Ok, I agree on this one.

However, it doesn't help with the (SMP) requirement that no CPU sees
the txbd_curr write before *info is written by the local CPU. Afaiui
dma_wmb() is too weak for this part. If we don't want wmb() here,
it will have to be dma_wmb() + smp_wmb().

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-22 Thread Lino Sanfilippo
On 22.05.2016 11:17, Shuyu Wei wrote:

> Hi Lino,
> 
> I tested this patch, it still got panic under stress.
> Just wget 2 large files simultaneously and it failed.
> 
> Looks like the problem comes from the if statement in tx_clean().
> I changed your patch to use 
> 
> -   if (info & FOR_EMAC)
> +   if ((info & FOR_EMAC) || !txbd->data || !skb)
> 
> and it worked. 

Thanks for testing. However that extra check for skb not being NULL should not 
be
necessary if the code were correct. The changes I suggested were all about 
having
skb and info consistent with txbd_curr.
But I just realized that there is still a big flaw in the last changes. While
tx() looks correct now (we first set up the descriptor and assign the skb and 
_then_
advance txbd_curr) tx_clean still is not:

We _first_ have to read tx_curr and _then_ read the corresponding descriptor 
and its skb.
(The last patch implemented just the reverse - and thus wrong - order, first 
get skb and 
descriptor and then read tx_curr).

So the patch below hopefully handles also tx_clean correctly. Could you please 
do once more a test
with this one?


> 
> After further test, my patch to barrier timestamp() didn't work.
> Just like the original code in the tree, the emac still got stuck under
> high load, even if I changed the smp_wmb() to dma_wmb(). So the original
> code do have race somewhere.

So to make this clear: with the current code in net-next you still see a 
problem (lockup), right?

> 
> I'm new to kernel development, and still trying to understand how memory
> barrier works

Its an interresting topic and thats what I am trying to understand, too :)


> ... and why Francois' fix worked. Please be patient with me :-).

So which fix(es) exactly work for you and solve your lockup issue?


--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -159,12 +159,22 @@ static void arc_emac_tx_clean(struct net_device *ndev)
unsigned int *txbd_dirty = >txbd_dirty;
struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
-   struct sk_buff *skb = tx_buff->skb;
unsigned int info = le32_to_cpu(txbd->info);
+   struct sk_buff *skb;
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (*txbd_dirty == priv->txbd_curr)
break;
 
+   /* Make sure curr pointer is consistent with info */
+   rmb();
+
+   info = le32_to_cpu(txbd->info);
+
+   if (info & FOR_EMAC)
+   break;
+
+   skb = tx_buff->skb;
+
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
stats->tx_errors++;
stats->tx_dropped++;
@@ -195,8 +205,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
}
 
-   /* Ensure that txbd_dirty is visible to tx() before checking
-* for queue stopped.
+   /* Ensure that txbd_dirty is visible to tx() and we see the most recent
+* value for txbd_curr.
 */
smp_mb();
 
@@ -680,35 +690,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
 
priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
-
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->tx_buff[*txbd_curr].skb = skb;
 
skb_tx_timestamp(skb);
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
+   /* 1. Make sure that with respect to tx_clean everything is set up
+* properly before we advance txbd_curr.
+* 2. Make sure writes to DMA descriptors are completed before we inform
+* the hardware.
+*/
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
-* checking the queue status. This prevents an unneeded wake
-* of the queue in tx_clean().
+   /* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
+* the updated value of txbd_curr.
 */
smp_mb();
 
-   if (!arc_emac_tx_avail(priv)) {
+   if (!arc_emac_tx_avail(priv))
netif_stop_queue(ndev);
-   /* Refresh tx_dirty */
-   smp_mb();
-   if (arc_emac_tx_avail(priv))
-   netif_start_queue(ndev);
-   }
 
arc_reg_set(priv, R_STATUS, TXPL_MASK);
 
-- 
1.9.1





Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-22 Thread Shuyu Wei
On Sun, May 22, 2016 at 12:58:55AM +0200, Lino Sanfilippo wrote:
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -162,7 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   struct sk_buff *skb = tx_buff->skb;
>   unsigned int info = le32_to_cpu(txbd->info);
>  
> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (info & FOR_EMAC)
> + break;
> +
> + /* Make sure curr pointer is consistent with info */
> + rmb();
> +
> + if (*txbd_dirty == priv->txbd_curr)
>   break;
>  
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
> @@ -195,8 +201,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   *txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>   }
>  
> - /* Ensure that txbd_dirty is visible to tx() before checking
> -  * for queue stopped.
> + /* Ensure that txbd_dirty is visible to tx() and we see the most recent
> +  * value for txbd_curr.
>*/
>   smp_mb();
>  
> @@ -680,35 +686,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
>   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> -
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
>   skb_tx_timestamp(skb);
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
>  
> - /* Make sure info word is set */
> + /* 1. Make sure that with respect to tx_clean everything is set up
> +  * properly before we advance txbd_curr.
> +  * 2. Make sure writes to DMA descriptors are completed before we inform
> +  * the hardware.
> +  */
>   wmb();
>  
> - priv->tx_buff[*txbd_curr].skb = skb;
> -
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> -  * checking the queue status. This prevents an unneeded wake
> -  * of the queue in tx_clean().
> + /* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
> +  * the updated value of txbd_curr.
>*/
>   smp_mb();
>  
> - if (!arc_emac_tx_avail(priv)) {
> + if (!arc_emac_tx_avail(priv))
>   netif_stop_queue(ndev);
> - /* Refresh tx_dirty */
> - smp_mb();
> - if (arc_emac_tx_avail(priv))
> - netif_start_queue(ndev);
> - }
>  
>   arc_reg_set(priv, R_STATUS, TXPL_MASK);
>  

Hi Lino,

I tested this patch, it still got panic under stress.
Just wget 2 large files simultaneously and it failed.

Looks like the problem comes from the if statement in tx_clean().
I changed your patch to use 

-   if (info & FOR_EMAC)
+   if ((info & FOR_EMAC) || !txbd->data || !skb)

and it worked. 

After further test, my patch to barrier timestamp() didn't work.
Just like the original code in the tree, the emac still got stuck under
high load, even if I changed the smp_wmb() to dma_wmb(). So the original
code do have race somewhere. 

I'm new to kernel development, and still trying to understand how memory
barrier works and why Francois' fix worked. Please be patient with me :-).

---

[  138.501355] Unable to handle kernel NULL pointer dereference at virtual 
address 00a8
[  138.509482] pgd = c0004000
[  138.512200] [00a8] *pgd=
[  138.515850] Internal error: Oops: 5 [#1] SMP ARM
[  138.520476] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0+ #98
[  138.526477] Hardware name: Rockchip (Device Tree)
[  138.531180] task: c0804e00 ti: c080 task.ti: c080
[  138.536588] PC is at __dev_kfree_skb_irq+0xc/0x8c
[  138.541295] LR is at arc_emac_poll+0x94/0x594
[  138.545653] pc : []lr : []psr: 60050113
[  138.545653] sp : c0801d88  ip : c0801da0  fp : c0801d9c
[  138.557118] r10: ee0a  r9 : ee0a1000  r8 : 01f8
[  138.562340] r7 :   r6 : 00fc  r5 : f08d0400  r4 : 03f0
[  138.568861] r3 : 0001  r2 : 0042  r1 : 0001  r0 : 
[  138.575385] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[  138.582515] Control: 10c5387d  Table: 8da4004a  DAC: 0051
[  138.588256] Process swapper/0 (pid: 0, stack limit = 0xc0800210)
[  138.594258] Stack: (0xc0801d88 to 0xc0802000)
[  138.598617] 1d80:   03f0 f08d0400 c0801df4 c0801da0 
c03c5e64 c046f590
[  138.606794] 1da0:  c0171050 c0801ddc ee0a04a8 c016504c 0028 
c0806614 f08d05f8
[  138.614969] 1dc0: ee0a02a8 0080 f0803100 ee0a04a8 c03c5dd0 0028 
012c 623d
[  138.623144] 1de0: c0802100 c0801e18 c0801e54 c0801df8 c0471c04 c03c5ddc 
c0801eb4 c06e5ae8
[  138.631318] 1e00: c08029f4 c08029f4 c083d821 2e86f000 c07486c0 eefb76c0 
c0801e18 c0801e18
[  

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-21 Thread Lino Sanfilippo
On 21.05.2016 21:47, Francois Romieu wrote:
> Shuyu Wei  :
> [...]
>> diff --git a/drivers/net/ethernet/arc/emac_main.c 
>> b/drivers/net/ethernet/arc/emac_main.c
>> index a3a9392..c2447b0 100644
>> --- a/drivers/net/ethernet/arc/emac_main.c
>> +++ b/drivers/net/ethernet/arc/emac_main.c
>> @@ -686,6 +686,9 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
>> net_device *ndev)
>>  
>> skb_tx_timestamp(skb);
>>  
>> +   /* Make sure timestamp is set */
>> +   smp_wmb();
> 
> Should be dma_wmb() (see davem's message).
> 
> It's completely unrelated to SMP.
> 

Its also completely unrelated to dma so I doubt that this is what davem meant.
As far as I understood he was referring to the dma descriptor.

Lino


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-21 Thread Lino Sanfilippo

On 21.05.2016 18:09, Shuyu Wei wrote:
> Looks like I got it wrong in the first place.
> 
> priv->tx_buff is not for the device, so there's no need to move it.
> The race has been fixed by commit c278c253f3d9, I forgot to check
> it out. That's my fault.
> 
> I do find another problem. We need to use a barrier to make sure
> skb_tx_timestamp() is called before setting the FOR_EMAC flag.
> 

Shuyu,

Could you please test the patch below? I implemented a new approach
in which tx_clean uses txbd_curr to determine if there are more 
descriptors to check or if the loop can be left. 
Memory barriers on both sides (xmit and clean) should ensure that 
SKB and info are only accessed if they are valid.
I also hope that skb_tx_timestamp is not an issue any more.


BTW: concerning get_maintainer Alexander Kochetkov should be CCed for
modifications in this area, so thats what I do hereby.

--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -162,7 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
struct sk_buff *skb = tx_buff->skb;
unsigned int info = le32_to_cpu(txbd->info);
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (info & FOR_EMAC)
+   break;
+
+   /* Make sure curr pointer is consistent with info */
+   rmb();
+
+   if (*txbd_dirty == priv->txbd_curr)
break;
 
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
@@ -195,8 +201,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
}
 
-   /* Ensure that txbd_dirty is visible to tx() before checking
-* for queue stopped.
+   /* Ensure that txbd_dirty is visible to tx() and we see the most recent
+* value for txbd_curr.
 */
smp_mb();
 
@@ -680,35 +686,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
 
priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
-
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->tx_buff[*txbd_curr].skb = skb;
 
skb_tx_timestamp(skb);
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
+   /* 1. Make sure that with respect to tx_clean everything is set up
+* properly before we advance txbd_curr.
+* 2. Make sure writes to DMA descriptors are completed before we inform
+* the hardware.
+*/
wmb();
 
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
-* checking the queue status. This prevents an unneeded wake
-* of the queue in tx_clean().
+   /* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
+* the updated value of txbd_curr.
 */
smp_mb();
 
-   if (!arc_emac_tx_avail(priv)) {
+   if (!arc_emac_tx_avail(priv))
netif_stop_queue(ndev);
-   /* Refresh tx_dirty */
-   smp_mb();
-   if (arc_emac_tx_avail(priv))
-   netif_start_queue(ndev);
-   }
 
arc_reg_set(priv, R_STATUS, TXPL_MASK);
 
-- 
1.9.1



Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-21 Thread Francois Romieu
Shuyu Wei  :
[...]
> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> b/drivers/net/ethernet/arc/emac_main.c
> index a3a9392..c2447b0 100644
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -686,6 +686,9 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>  
> skb_tx_timestamp(skb);
>  
> +   /* Make sure timestamp is set */
> +   smp_wmb();

Should be dma_wmb() (see davem's message).

It's completely unrelated to SMP.

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-21 Thread Shuyu Wei
Looks like I got it wrong in the first place.

priv->tx_buff is not for the device, so there's no need to move it.
The race has been fixed by commit c278c253f3d9, I forgot to check
it out. That's my fault.

I do find another problem. We need to use a barrier to make sure
skb_tx_timestamp() is called before setting the FOR_EMAC flag.

According to the comment(include/linux/skbuff.h):
>/**
> * skb_tx_timestamp() - Driver hook for transmit timestamping
> *
> * Ethernet MAC Drivers should call this function in their hard_xmit()
> * function immediately before giving the sk_buff to the MAC hardware.
> *
> * Specifically, one should make absolutely sure that this function is
> * called before TX completion of this packet can trigger.  Otherwise
> * the packet could potentially already be freed.
> *
> * @skb: A socket buffer.
> */

---

diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index a3a9392..c2447b0 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -686,6 +686,9 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
 
skb_tx_timestamp(skb);
 
+   /* Make sure timestamp is set */
+   smp_wmb();
+
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
/* Make sure info word is set */



Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-21 Thread Shuyu Wei
On Thu, May 19, 2016 at 11:15:56PM +0200, Lino Sanfilippo wrote:
>  drivers/net/ethernet/arc/emac_main.c | 33 +
>  1 file changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> b/drivers/net/ethernet/arc/emac_main.c
> index a3a9392..b986499 100644
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -162,8 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   struct sk_buff *skb = tx_buff->skb;
>   unsigned int info = le32_to_cpu(txbd->info);
>  
> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (info & FOR_EMAC) {
> + /* Make sure we see consistent values for info, skb
> +  * and data.
> +  */
> + rmb();
>   break;
> + }
>  
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>   stats->tx_errors++;
> @@ -680,35 +685,31 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
>   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> -
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
>   skb_tx_timestamp(skb);
>  
> - *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> -
> - /* Make sure info word is set */
> + /* Make sure info is set after data and skb with respect to
> +  * tx_clean().
> +  */
>   wmb();
>  
> - priv->tx_buff[*txbd_curr].skb = skb;
> + *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
>  
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> + /* 1.Ensure that tx_clean() sees the new txbd_curr before
>* checking the queue status. This prevents an unneeded wake
>* of the queue in tx_clean().
> +  * 2.Ensure that all values are written to RAM and to DMA
> +  * before hardware is informed.
> +  * 3.Ensure we see the most recent value for tx_dirty.
>*/
> - smp_mb();
> + mb();
>  
> - if (!arc_emac_tx_avail(priv)) {
> + if (!arc_emac_tx_avail(priv))
>   netif_stop_queue(ndev);
> - /* Refresh tx_dirty */
> - smp_mb();
> - if (arc_emac_tx_avail(priv))
> - netif_start_queue(ndev);
> - }
>  
>   arc_reg_set(priv, R_STATUS, TXPL_MASK);
> 
> 
Hi Lino,
I applied your patch and tested on my devices, it directly went to
panic. So there must be something wrong.


[   13.084412] rockchip_emac 10204000.ethernet eth0: Link is Up - 100Mbps/Full 
- flow control rx/tx
[   13.093360] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   13.105915] Unable to handle kernel NULL pointer dereference at virtual 
address 00a8
[   13.114173] pgd = c0004000
[   13.117069] [00a8] *pgd=
[   13.120778] Internal error: Oops: 5 [#1] SMP ARM
[   13.125404] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6.0+ #88
[   13.131404] Hardware name: Rockchip (Device Tree)
[   13.136108] task: c0804e00 ti: c080 task.ti: c080
[   13.141515] PC is at __dev_kfree_skb_irq+0xc/0x8c
[   13.146223] LR is at arc_emac_poll+0x94/0x584
[   13.150581] pc : []lr : []psr: 60080113
[   13.150581] sp : c0801d88  ip : c0801da0  fp : c0801d9c
[   13.162046] r10: ee092000  r9 : ee093000  r8 : 0008
[   13.167268] r7 :   r6 : 0004  r5 : f08d0400  r4 : 0010
[   13.173789] r3 : 0001  r2 :   r1 : 0001  r0 : 
[   13.180312] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   13.187441] Control: 10c5387d  Table: 8d24c04a  DAC: 0051
[   13.193184] Process swapper/0 (pid: 0, stack limit = 0xc0800210)
[   13.199187] Stack: (0xc0801d88 to 0xc0802000)
[   13.203548] 1d80:   0010 f08d0400 c0801df4 c0801da0 
c03c5e64 c046f588
[   13.211723] 1da0:  ee0cc154  ee0924a8 c0801dcc 0028 
c0806614 f08d0408
[   13.219897] 1dc0: ee0922a8 007f  ee0924a8 c03c5dd0 0028 
012c fffee77c
[   13.228071] 1de0: c0802100 c0801e18 c0801e54 c0801df8 c0471bfc c03c5ddc 
c0801e3c c06e5ae8
[   13.236245] 1e00: c08029f4 c08029f4 c083d821 2e86f000 c07486c0 eefb76c0 
c0801e18 c0801e18
[   13.244419] 1e20: c0801e20 c0801e20 ee841900  c080208c c080 
0100 0003
[   13.252594] 1e40: c0802080 4003 c0801eb4 c0801e58 c012810c c04719fc 
0001 ee808000
[   13.260768] 1e60: c08024a8 0020 c0802100 fffee77b 000a c06017ec 
c083f940 c07432f8
[   13.268941] 1e80: c0802080 c0801e58 c0801eb4 c07463cc   
0001 ee808000
[   13.277115] 1ea0: c08024a8 eefe01c0 c0801ec4 c0801eb8 c01284f8 c0127ff4 
c0801eec c0801ec8
[   

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-19 Thread Francois Romieu
Lino Sanfilippo  :
[...]
> 2. requires a smp_wmb, while 3. requires a rmb(). AFAIK the mb() implies all 
> we need, 
> the dma_rmb() for 1., the smp_wmb() for 2. and the rmb() for 3. 

A revalidation of tx_dirty is still needed (see below) despite the rmb()
for 3. The rmb() for 3. is close to useless.

> >> 2. On multi processor systems: ensures that txbd_curr is updated (this is 
> >> paired
> >> with the smp_mb() at the end of tx_clean).
> > 
> > Smells like using barrier side-effects to control smp coherency. It isn't
> > the recommended style.
> > 
> 
> As I wrote above, the mb() implies the smp_wmb() so this is a regular pairing
> of smp_wmb() in xmit and smb_mb() in tx_clean, nothing uncommon.

Since you are quoting Documentation/memory-barriers.txt:
[...]
CPU MEMORY BARRIERS
---
[...]
Mandatory barriers should not be used to control SMP effects, since mandatory
barriers impose unnecessary overhead on both SMP and UP systems.

> >> -  if ((info & FOR_EMAC) || !txbd->data || !skb)
> >> +  if (info & FOR_EMAC) {
> >> +  /* Make sure we see consistent values for info, skb
> >> +   * and data.
> >> +   */
> >> +  smp_rmb();
> >>break;
> >> +  }
> > 
> > ?
> > 
> > smp_rmb should appear before the variables you want coherency for.
> 
> I dont think so. Please take a look into the memory barriers documentation.

> There is an example that is pretty close to the situation that we have in 
> this driver:
> 
> http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt#L1819
> 
> In that example the barrier is also _between_ the variables that are to be
> ordered, not before. 

Err, which barrier ?

- dma_rmb() ?
  The device (obviously) set the 'status' member of the descriptor.
  dma_rmb() ensures that device-initiated DMA is complete for the 'data'
  member as well.

- dma_wmb() ?
  It ensures that the updated 'data' member will be set before any
  DMA resulting from the change of descriptor ownership takes place.

- wmb() ?
  It ensures that the previous write to descriptor (coherent memory)
  completes before write is posted to I/O mailbox.

None of these is "pretty close" to the "smp_rmb() before return" situation.

> >> -  skb_tx_timestamp(skb);
> >> +  /* Make sure info is set after data and skb with respect to
> >> +   * other tx_clean().
> >> +   */
> >> +  smp_wmb();
> >>  
> >>*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> > 
> > Afaik smp_wmb() does not imply wmb(). So priv->txbd[*txbd_curr].data and
> > *info (aka priv->txbd[*txbd_curr].info) are not necessarily written in
> > an orderly manner.
> 
> Right, as I already wrote above I changed the smp barriers to mandatory ones.
> 
> > 
> >>  
> >> -  /* Make sure info word is set */
> >> -  wmb();
> >> -
> >> -  priv->tx_buff[*txbd_curr].skb = skb;
> >> -
> >>/* Increment index to point to the next BD */
> >>*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
> > 
> > With this change it's possible that tx_clean() reads new value for
> > tx_curr and old value (0) for *info.
> 
> Even if this could happen, what is the problem? I cant see an issue
> that results from such a scenario.

tx_clean() misunderstands the 0 in *info as a descriptor updated by the
device. tx_clean() thus kfrees the skb before the device DMAed from it.

[...]
> > Xmit thread| Clean thread
> > 
> > mb();
> > 
> > arc_emac_tx_avail() test with old
> > tx_dirty - tx_clean has not issued
> > any mb yet - and new tx_curr
> > 
> >  smp_mb();
> > 
> >  if (netif_queue_stopped(ndev) && ...
> >  netif_wake_queue(ndev);
> > 
> > netif_stop_queue()
> > 
> > -> queue stopped.
> > 
> 
> Again, the mb() we have now implies the smb_mb() we had before. So nothing
> changed except that we can be sure to see the new value for tx_dirty at
> our first attempt.

Nothing changed except you removed the revalidation part !

The smp_mb() we had before wasn't about seeing tx_dirty in the xmit thread
but about balancing the (read) barrier in the cleaning thread so that the
latter stood a chance to see the new (tx thread updated) tx_curr.

Consider the two lines below:

if (!arc_emac_tx_avail(priv))
netif_stop_queue(ndev);

Nothing prevents a complete run of the cleaning thread between these
two lines. It may or it may not happen but there is one thing sure:
mb() before arc_emac_tx_avail() can't tell the future.

[...]
> New patch is below. 

The arc_emac_tx_clean() change is wrong.

tx_drity revalidation is still needed in arc_emac_tx after netif_stop_queue.

A barrier is still missing in arc_emac_tx between descriptor release and
tx_curr increase.

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-19 Thread Lino Sanfilippo
Hi Francois,

On 19.05.2016 00:55, Francois Romieu wrote:

>> The smp_wmb() in tx function combined with the smp_rmb() in tx_clean ensures
>> that the CPU running tx_clean sees consistent values for info, data and skb 
>> (thus no need to check for validity of all three values any more).
>> The mb() fulfills several tasks:
>> 1. makes sure that DMA writes to descriptor are completed before the HW is
>> informed.
> 
> "DMA writes" == "CPU writes" ?
> 

I admit that phrasing was unfortunate. What I meant was, that we have to make 
sure that
the writes to the descriptors which reside in DMA memory have to be completed 
before
we inform the hardware. This is since the write to the mmio 
(arc_reg_set(R_STATUS, TXPL_MASK))
may otherwise overtake the writes to the desriptors in DMA and thus the 
hardware find an
incompletely set up descriptor.
You can find a barrier for this reason in quite a lot drivers. Normally a 
dma_rmb() would
be sufficient for this, e.g here

http://lxr.free-electrons.com/source/drivers/net/ethernet/marvell/sky2.c#L1139

or here

http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgb/ixgb_main.c#L1468

but we have to use the mb() instead, because we also have to solve issues 2. 
and 3.:

2. requires a smp_wmb, while 3. requires a rmb(). AFAIK the mb() implies all we 
need, 
the dma_rmb() for 1., the smp_wmb() for 2. and the rmb() for 3. 

>> 2. On multi processor systems: ensures that txbd_curr is updated (this is 
>> paired
>> with the smp_mb() at the end of tx_clean).
> 
> Smells like using barrier side-effects to control smp coherency. It isn't
> the recommended style.
> 

As I wrote above, the mb() implies the smp_wmb() so this is a regular pairing
of smp_wmb() in xmit and smb_mb() in tx_clean, nothing uncommon.

>> -if ((info & FOR_EMAC) || !txbd->data || !skb)
>> +if (info & FOR_EMAC) {
>> +/* Make sure we see consistent values for info, skb
>> + * and data.
>> + */
>> +smp_rmb();
>>  break;
>> +}
> 
> ?
> 
> smp_rmb should appear before the variables you want coherency for.

I dont think so. Please take a look into the memory barriers documentation.
There is an example that is pretty close to the situation that we have in this 
driver:

http://lxr.free-electrons.com/source/Documentation/memory-barriers.txt#L1819

In that example the barrier is also _between_ the variables that are to be 
ordered, not 
before. 
Also in the example a dma_rmb() is sufficient, because in the sample code only 
reads to DMA
memory have to be ordered. In our case we want to order between DMA (info) and 
RAM ("data" and the skb).
However smp barriers are not sufficient (as I used it in the former patch), 
since we do not only
want to sync CPU-CPU but also CPU-DMA. Please see the new patch I attached in 
which mandatory 
barriers are used instead. 


>> -skb_tx_timestamp(skb);
>> +/* Make sure info is set after data and skb with respect to
>> + * other tx_clean().
>> + */
>> +smp_wmb();
>>  
>>  *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> 
> Afaik smp_wmb() does not imply wmb(). So priv->txbd[*txbd_curr].data and
> *info (aka priv->txbd[*txbd_curr].info) are not necessarily written in
> an orderly manner.

Right, as I already wrote above I changed the smp barriers to mandatory ones.

> 
>>  
>> -/* Make sure info word is set */
>> -wmb();
>> -
>> -priv->tx_buff[*txbd_curr].skb = skb;
>> -
>>  /* Increment index to point to the next BD */
>>  *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
> 
> With this change it's possible that tx_clean() reads new value for
> tx_curr and old value (0) for *info.

Even if this could happen, what is the problem? I cant see an issue
that results from such a scenario.

>>   * of the queue in tx_clean().
>> + * 2.Ensure that all values are written to RAM and to DMA
>> + * before hardware is informed.
> 
> (I am not sure what "DMA" is supposed to mean here.)
> 

This is about the DMA/mmio race I described above.

>> + * 3.Ensure we see the most recent value for tx_dirty.
>>   */
>> -smp_mb();
>> +mb();
>>  
>> -if (!arc_emac_tx_avail(priv)) {
>> +if (!arc_emac_tx_avail(priv))
>>  netif_stop_queue(ndev);
>> -/* Refresh tx_dirty */
>> -smp_mb();
>> -if (arc_emac_tx_avail(priv))
>> -netif_start_queue(ndev);
>> -}
> 
> Xmit thread| Clean thread
> 
> mb();
> 
> arc_emac_tx_avail() test with old
> tx_dirty - tx_clean has not issued
> any mb yet - and new tx_curr
> 
>  smp_mb();
> 
>  if (netif_queue_stopped(ndev) && ...
>  netif_wake_queue(ndev);
> 
> netif_stop_queue()
> 
> -> queue stopped.
> 

Again, the mb() we have now implies 

Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-18 Thread Francois Romieu
Lino Sanfilippo  :
[...]
> what about the (only compile tested) code below?

I may have misunderstood some parts but it nonetheless seems broken.

> The smp_wmb() in tx function combined with the smp_rmb() in tx_clean ensures
> that the CPU running tx_clean sees consistent values for info, data and skb 
> (thus no need to check for validity of all three values any more).
> The mb() fulfills several tasks:
> 1. makes sure that DMA writes to descriptor are completed before the HW is
> informed.

"DMA writes" == "CPU writes" ?

> 2. On multi processor systems: ensures that txbd_curr is updated (this is 
> paired
> with the smp_mb() at the end of tx_clean).

Smells like using barrier side-effects to control smp coherency. It isn't
the recommended style.

> 3. Ensure we see the most recent value for tx_dirty. With this we do not have 
> to
> recheck after we stopped the tx queue.
> 
> 
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -162,8 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>   struct sk_buff *skb = tx_buff->skb;
>   unsigned int info = le32_to_cpu(txbd->info);
>  
> - if ((info & FOR_EMAC) || !txbd->data || !skb)
> + if (info & FOR_EMAC) {
> + /* Make sure we see consistent values for info, skb
> +  * and data.
> +  */
> + smp_rmb();
>   break;
> + }

?

smp_rmb should appear before the variables you want coherency for.

>  
>   if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>   stats->tx_errors++;
> @@ -679,36 +684,33 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   dma_unmap_addr_set(>tx_buff[*txbd_curr], addr, addr);
>   dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
>  
> - priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
>  
> - /* Make sure pointer to data buffer is set */
> - wmb();
> + priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> + priv->tx_buff[*txbd_curr].skb = skb;
>  
> - skb_tx_timestamp(skb);
> + /* Make sure info is set after data and skb with respect to
> +  * other tx_clean().
> +  */
> + smp_wmb();
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);

Afaik smp_wmb() does not imply wmb(). So priv->txbd[*txbd_curr].data and
*info (aka priv->txbd[*txbd_curr].info) are not necessarily written in
an orderly manner.

>  
> - /* Make sure info word is set */
> - wmb();
> -
> - priv->tx_buff[*txbd_curr].skb = skb;
> -
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;

With this change it's possible that tx_clean() reads new value for
tx_curr and old value (0) for *info.

>  
> - /* Ensure that tx_clean() sees the new txbd_curr before
> + /* 1.Ensure that tx_clean() sees the new txbd_curr before
>* checking the queue status. This prevents an unneeded wake
>* of the queue in tx_clean().
> +  * 2.Ensure that all values are written to RAM and to DMA
> +  * before hardware is informed.

(I am not sure what "DMA" is supposed to mean here.)

> +  * 3.Ensure we see the most recent value for tx_dirty.
>*/
> - smp_mb();
> + mb();
>  
> - if (!arc_emac_tx_avail(priv)) {
> + if (!arc_emac_tx_avail(priv))
>   netif_stop_queue(ndev);
> - /* Refresh tx_dirty */
> - smp_mb();
> - if (arc_emac_tx_avail(priv))
> - netif_start_queue(ndev);
> - }

Xmit thread| Clean thread

mb();

arc_emac_tx_avail() test with old
tx_dirty - tx_clean has not issued
any mb yet - and new tx_curr

 smp_mb();

 if (netif_queue_stopped(ndev) && ...
 netif_wake_queue(ndev);

netif_stop_queue()

-> queue stopped.

You can't remove the revalidation step.

arc_emac_tx_avail() is essentially pessimistic. Even if arc_emac_tx_avail()
was "right", there would be a tx_clean window between arc_emac_tx_avail()
and netif_stop_queue().

> +
> + skb_tx_timestamp(skb);

You don't want to issue skb_tx_timestamp after releasing control of the
descriptor (*info = ...): skb may be long gone.

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-18 Thread Lino Sanfilippo
On 18.05.2016 02:01, Francois Romieu wrote:

> The smp_wmb() and wmb() could be made side-by-side once *info is
> updated but I don't see the adequate idiom to improve the smp_wmb + wmb
> combo. :o/
> 
>> And the wmb() looks like it should be a dma_wmb().
> 
> I see two points against it:
> - it could be too late for skb_tx_timestamp().
> - arc_emac_tx_clean must not see an index update before the device
>   got a chance to acquire the descriptor. arc_emac_tx_clean can't
>   tell the difference between an about-to-be-released descriptor
>   and a returned-from-device one.
> 

Hi,

what about the (only compile tested) code below?

The smp_wmb() in tx function combined with the smp_rmb() in tx_clean ensures
that the CPU running tx_clean sees consistent values for info, data and skb 
(thus no need to check for validity of all three values any more).

The mb() fulfills several tasks:
1. makes sure that DMA writes to descriptor are completed before the HW is
informed.
2. On multi processor systems: ensures that txbd_curr is updated (this is paired
with the smp_mb() at the end of tx_clean).
3. Ensure we see the most recent value for tx_dirty. With this we do not have to
recheck after we stopped the tx queue.


--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -162,8 +162,13 @@ static void arc_emac_tx_clean(struct net_device *ndev)
struct sk_buff *skb = tx_buff->skb;
unsigned int info = le32_to_cpu(txbd->info);
 
-   if ((info & FOR_EMAC) || !txbd->data || !skb)
+   if (info & FOR_EMAC) {
+   /* Make sure we see consistent values for info, skb
+* and data.
+*/
+   smp_rmb();
break;
+   }
 
if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
stats->tx_errors++;
@@ -679,36 +684,33 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
net_device *ndev)
dma_unmap_addr_set(>tx_buff[*txbd_curr], addr, addr);
dma_unmap_len_set(>tx_buff[*txbd_curr], len, len);
 
-   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
 
-   /* Make sure pointer to data buffer is set */
-   wmb();
+   priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
+   priv->tx_buff[*txbd_curr].skb = skb;
 
-   skb_tx_timestamp(skb);
+   /* Make sure info is set after data and skb with respect to
+* other tx_clean().
+*/
+   smp_wmb();
 
*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-   /* Make sure info word is set */
-   wmb();
-
-   priv->tx_buff[*txbd_curr].skb = skb;
-
/* Increment index to point to the next BD */
*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-   /* Ensure that tx_clean() sees the new txbd_curr before
+   /* 1.Ensure that tx_clean() sees the new txbd_curr before
 * checking the queue status. This prevents an unneeded wake
 * of the queue in tx_clean().
+* 2.Ensure that all values are written to RAM and to DMA
+* before hardware is informed.
+* 3.Ensure we see the most recent value for tx_dirty.
 */
-   smp_mb();
+   mb();
 
-   if (!arc_emac_tx_avail(priv)) {
+   if (!arc_emac_tx_avail(priv))
netif_stop_queue(ndev);
-   /* Refresh tx_dirty */
-   smp_mb();
-   if (arc_emac_tx_avail(priv))
-   netif_start_queue(ndev);
-   }
+
+   skb_tx_timestamp(skb);
 
arc_reg_set(priv, R_STATUS, TXPL_MASK);
 
-- 
2.7.0

Regards,
Lino


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-17 Thread Francois Romieu
David Miller  :
> From: Shuyu Wei 
> Date: Tue, 17 May 2016 23:25:20 +0800
> 
> > diff --git a/drivers/net/ethernet/arc/emac_main.c 
> > b/drivers/net/ethernet/arc/emac_main.c
> > index a3a9392..df3dfef 100644
> > --- a/drivers/net/ethernet/arc/emac_main.c
> > +++ b/drivers/net/ethernet/arc/emac_main.c
> > @@ -153,9 +153,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
> >  {
> > struct arc_emac_priv *priv = netdev_priv(ndev);
> > struct net_device_stats *stats = >stats;
> > -   unsigned int i;
> >  
> > -   for (i = 0; i < TX_BD_NUM; i++) {
> > +   while (priv->txbd_dirty != priv->txbd_curr) {
> > unsigned int *txbd_dirty = >txbd_dirty;
> > struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
> > struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
> > @@ -685,13 +684,15 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> > net_device *ndev)
> > wmb();
> >  
> > skb_tx_timestamp(skb);
> > +   priv->tx_buff[*txbd_curr].skb = skb;
> > +
> > +   dma_wmb();
> >  
> > *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> >  
> > /* Make sure info word is set */
> > wmb();
> >  
> > -   priv->tx_buff[*txbd_curr].skb = skb;
> >  
> > /* Increment index to point to the next BD */
> > *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
> > 
> 
> These memory barriers do not look correct to me.
> 
> dma_wmb() is about visibility between CPU reads/writes and device
> accesses to a piece of memory.  But what you're concerned about wrt.
> the SKB pointer assignment is CPU to CPU accesses.  Therefore something
> like smp_wmb() would be appropriate.

Something like:

skb_tx_timestamp(skb);

/* CPU write vs device access. Must be done before releasing control
 * of the descriptor (*info).
 */
dma_wmb();

priv->tx_buff[*txbd_curr].skb = skb;

/* CPU arc_emac_tx_clean vs CPU arc_emac_tx. Must be done before
 * index (tx_curr) update. Does not necessarily deserves to be done
 * before releasing control of the descriptor (*info) due to
 * descriptor vs index ordering.
 *
 * FIXME: missing smp_rmb before the while loop in arc_emac_tx_clean.
 */
smp_wmb();

*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);

/* local descriptor (*info) update vs index (tx_curr) update. */
wmb();

*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;

smp_mb();   // The driver alreay contains this one.

The smp_wmb() and wmb() could be made side-by-side once *info is
updated but I don't see the adequate idiom to improve the smp_wmb + wmb
combo. :o/

> And the wmb() looks like it should be a dma_wmb().

I see two points against it:
- it could be too late for skb_tx_timestamp().
- arc_emac_tx_clean must not see an index update before the device
  got a chance to acquire the descriptor. arc_emac_tx_clean can't
  tell the difference between an about-to-be-released descriptor
  and a returned-from-device one.

-- 
Ueimor


Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer

2016-05-17 Thread David Miller
From: Shuyu Wei 
Date: Tue, 17 May 2016 23:25:20 +0800

> diff --git a/drivers/net/ethernet/arc/emac_main.c 
> b/drivers/net/ethernet/arc/emac_main.c
> index a3a9392..df3dfef 100644
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -153,9 +153,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>  {
>   struct arc_emac_priv *priv = netdev_priv(ndev);
>   struct net_device_stats *stats = >stats;
> - unsigned int i;
>  
> - for (i = 0; i < TX_BD_NUM; i++) {
> + while (priv->txbd_dirty != priv->txbd_curr) {
>   unsigned int *txbd_dirty = >txbd_dirty;
>   struct arc_emac_bd *txbd = >txbd[*txbd_dirty];
>   struct buffer_state *tx_buff = >tx_buff[*txbd_dirty];
> @@ -685,13 +684,15 @@ static int arc_emac_tx(struct sk_buff *skb, struct 
> net_device *ndev)
>   wmb();
>  
>   skb_tx_timestamp(skb);
> + priv->tx_buff[*txbd_curr].skb = skb;
> +
> + dma_wmb();
>  
>   *info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
>  
>   /* Make sure info word is set */
>   wmb();
>  
> - priv->tx_buff[*txbd_curr].skb = skb;
>  
>   /* Increment index to point to the next BD */
>   *txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
> 

These memory barriers do not look correct to me.

dma_wmb() is about visibility between CPU reads/writes and device
accesses to a piece of memory.  But what you're concerned about wrt.
the SKB pointer assignment is CPU to CPU accesses.  Therefore something
like smp_wmb() would be appropriate.

And the wmb() looks like it should be a dma_wmb().