[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Jerin Jacob
On Sat, Jul 23, 2016 at 12:32:01PM +, Ananyev, Konstantin wrote:
> 
> 
> > -Original Message-
> > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > Sent: Saturday, July 23, 2016 12:49 PM
> > To: Ananyev, Konstantin 
> > Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> > ; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> > ordering before tail update
> > 
> > On Sat, Jul 23, 2016 at 11:15:27AM +, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > > Sent: Saturday, July 23, 2016 11:39 AM
> > > > To: Ananyev, Konstantin 
> > > > Cc: Thomas Monjalon ; Juhamatti
> > > > Kuusisaari ; dev at dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to
> > > > guarantee ordering before tail update
> > > >
> > > > On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > > > > Hi lads,
> > > > >
> > > > > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > > > > caviumnetworks.com>:
> > > > > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > > > > >> > > fully before the tail is updated. This is not
> > > > > > > >> > > guaranteed with a read barrier, changed to a write
> > > > > > > >> > > barrier just before tail update which in
> > > > > > practice guarantees correct order of reads and writes.
> > > > > > > >> > >
> > > > > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > > > > >> > > 
> > > > > > > >> >
> > > > > > > >> > Acked-by: Konstantin Ananyev
> > > > > > > >> > 
> > > > > > > >>
> > > > > > > >> Applied, thanks
> > > > > > > >
> > > > > > > > There was ongoing discussion on this
> > > > > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > > > > >
> > > > > > > Sorry Jerin, I forgot this email.
> > > > > > > The problem is that nobody replied to your email and you did
> > > > > > > not nack the v2 of this patch.
> > > > >
> > > > > It's probably my bad.
> > > > > I acked the patch before Jerin response, and forgot to reply later.
> > > > >
> > > > > > >
> > > > > > > > This change may not be required as it has the performance 
> > > > > > > > impact.
> > > > > > >
> > > > > > > We need to clearly understand what is the performance impact
> > > > > > > (numbers and use cases) on one hand, and is there a real bug
> > > > > > > fixed by this patch on the other hand?
> > > > > >
> > > > > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > > > > LOAD-STORE barrier to make sure tail pointer WRITE happens only 
> > > > > > after prior LOADS.
> > > > >
> > > > > Yep, from what I read at the link Jerin provided, indeed it seems 
> > > > > rte_smp_rmb() is enough for the arm arch here...
> > > > > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > > > > instruction.
> > > > >
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > Wonder how big is a performance impact?
> > > >
> > > > With this change we need to wait for addtional STORES to be completed 
> > > > to local buffer in addtion to LOADS from ring buffers memory.
> > >
> > > I understand that, just wonder did you see any real performance 
> > > difference?
> > 
> > Yeah...
> 
> Ok, then I don't see any good reason why we shouldn't revert it.
> I suppose the best way would be to submit a new patch for RC5 to revert the 
> changes.
> Do you prefer to submit it yourself and I'll ack it or visa-versa?

OK. I will submit it then

> Thanks
> Konstantin 
> 
> > 
> > > Probably with ring_perf_autotest/mempool_perf_autotest or something?
> > 
> > W/O change
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 
> > 4 MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
> > (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue 
> > (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0
> > 
> > ### Testing empty dequeue ###
> > SC empty dequeue: 0.35
> > MC empty dequeue: 0.60
> > 
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 0.93
> > MP/MC bulk enq/dequeue (size: 8): 2.45
> > SP/SC bulk enq/dequeue (size: 32): 0.58
> > MP/MC bulk enq/dequeue (size: 32): 0.97
> > 
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 
> > 1.89 MP/MC bulk enq/dequeue (size: 8): 4.28 SP/SC bulk
> > enq/dequeue (size: 32): 0.90 MP/MC bulk enq/dequeue (size: 32): 1.19 Test OK
> > RTE>>
> > 
> > With change
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 
> > 6 MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
> > (size: 8): 1 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue 
> > (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0
> > 
> > ### 

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Jerin Jacob
On Sat, Jul 23, 2016 at 11:15:27AM +, Ananyev, Konstantin wrote:
> 
> 
> > -Original Message-
> > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > Sent: Saturday, July 23, 2016 11:39 AM
> > To: Ananyev, Konstantin 
> > Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> > ; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> > ordering before tail update
> > 
> > On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > > Hi lads,
> > >
> > > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > > caviumnetworks.com>:
> > > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > > >> > > fully before the tail is updated. This is not guaranteed
> > > > > >> > > with a read barrier, changed to a write barrier just before
> > > > > >> > > tail update which in
> > > > practice guarantees correct order of reads and writes.
> > > > > >> > >
> > > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > > >> > > 
> > > > > >> >
> > > > > >> > Acked-by: Konstantin Ananyev 
> > > > > >>
> > > > > >> Applied, thanks
> > > > > >
> > > > > > There was ongoing discussion on this
> > > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > > >
> > > > > Sorry Jerin, I forgot this email.
> > > > > The problem is that nobody replied to your email and you did not
> > > > > nack the v2 of this patch.
> > >
> > > It's probably my bad.
> > > I acked the patch before Jerin response, and forgot to reply later.
> > >
> > > > >
> > > > > > This change may not be required as it has the performance impact.
> > > > >
> > > > > We need to clearly understand what is the performance impact
> > > > > (numbers and use cases) on one hand, and is there a real bug fixed
> > > > > by this patch on the other hand?
> > > >
> > > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > > LOAD-STORE barrier to make sure tail pointer WRITE happens only after 
> > > > prior LOADS.
> > >
> > > Yep, from what I read at the link Jerin provided, indeed it seems 
> > > rte_smp_rmb() is enough for the arm arch here...
> > > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > > instruction.
> > >
> > > >
> > > > Thoughts?
> > >
> > > Wonder how big is a performance impact?
> > 
> > With this change we need to wait for addtional STORES to be completed to 
> > local buffer in addtion to LOADS from ring buffers memory.
> 
> I understand that, just wonder did you see any real performance difference?

Yeah...

> Probably with ring_perf_autotest/mempool_perf_autotest or something?

W/O change 
RTE>>ring_perf_autotest 
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 4
MP/MC single enq/dequeue: 16
SP/SC burst enq/dequeue (size: 8): 0
MP/MC burst enq/dequeue (size: 8): 2
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.35
MC empty dequeue: 0.60

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 0.93
MP/MC bulk enq/dequeue (size: 8): 2.45
SP/SC bulk enq/dequeue (size: 32): 0.58
MP/MC bulk enq/dequeue (size: 32): 0.97

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 1.89
MP/MC bulk enq/dequeue (size: 8): 4.28
SP/SC bulk enq/dequeue (size: 32): 0.90
MP/MC bulk enq/dequeue (size: 32): 1.19
Test OK
RTE>>

With change
RTE>>ring_perf_autotest 
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 6
MP/MC single enq/dequeue: 16
SP/SC burst enq/dequeue (size: 8): 1
MP/MC burst enq/dequeue (size: 8): 2
SP/SC burst enq/dequeue (size: 32): 0
MP/MC burst enq/dequeue (size: 32): 0

### Testing empty dequeue ###
SC empty dequeue: 0.35
MC empty dequeue: 0.60

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 1.28
MP/MC bulk enq/dequeue (size: 8): 2.47
SP/SC bulk enq/dequeue (size: 32): 0.64
MP/MC bulk enq/dequeue (size: 32): 0.97

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 2.08
MP/MC bulk enq/dequeue (size: 8): 4.29
SP/SC bulk enq/dequeue (size: 32): 1.24
MP/MC bulk enq/dequeue (size: 32): 1.19
Test OK

> Konstantin 
> 
> > 
> > > If there is a real one, I suppose we can revert the patch?
> > 
> > Request to revert this one as their no benifts for other architectures and 
> > indeed it creates addtional delay in waiting for STORES to complete
> > in ARM.
> > Lets do the correct thing by reverting it.
> > 
> > Jerin
> > 
> > 
> > 
> > > Konstantin
> > >
> > > >
> > > > >
> > > > > Please guys make things clear and we'll revert if needed.


[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Jerin Jacob
On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> Hi lads,
> 
> > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob :
> > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > >> > > Consumer queue dequeuing must be guaranteed to be done fully
> > > >> > > before the tail is updated. This is not guaranteed with a read 
> > > >> > > barrier, changed to a write barrier just before tail update which 
> > > >> > > in
> > practice guarantees correct order of reads and writes.
> > > >> > >
> > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > >> > > 
> > > >> >
> > > >> > Acked-by: Konstantin Ananyev 
> > > >>
> > > >> Applied, thanks
> > > >
> > > > There was ongoing discussion on this
> > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > >
> > > Sorry Jerin, I forgot this email.
> > > The problem is that nobody replied to your email and you did not nack
> > > the v2 of this patch.
> 
> It's probably my bad.
> I acked the patch before Jerin response, and forgot to reply later. 
> 
> > >
> > > > This change may not be required as it has the performance impact.
> > >
> > > We need to clearly understand what is the performance impact (numbers
> > > and use cases) on one hand, and is there a real bug fixed by this
> > > patch on the other hand?
> > 
> > IHMO, there is no real bug here. rte_smb_rmb() provides the LOAD-STORE 
> > barrier to make sure tail pointer WRITE happens only after prior
> > LOADS.
> 
> Yep, from what I read at the link Jerin provided, indeed it seems 
> rte_smp_rmb() is enough for the arm arch here...
> For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> instruction.
> 
> > 
> > Thoughts?
> 
> Wonder how big is a performance impact?

With this change we need to wait for addtional STORES to be completed to
local buffer in addtion to LOADS from ring buffers memory.

> If there is a real one, I suppose we can revert the patch?

Request to revert this one as their no benifts for other architectures
and indeed it creates addtional delay in waiting for STORES to complete in ARM.
Lets do the correct thing by reverting it.

Jerin



> Konstantin 
> 
> > 
> > >
> > > Please guys make things clear and we'll revert if needed.


[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Jerin Jacob
On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> 2016-07-23 8:05 GMT+02:00 Jerin Jacob :
> > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> >> > > Consumer queue dequeuing must be guaranteed to be done fully before 
> >> > > the tail is updated. This is not guaranteed with a read barrier,
> >> > > changed to a write barrier just before tail update which in practice 
> >> > > guarantees correct order of reads and writes.
> >> > >
> >> > > Signed-off-by: Juhamatti Kuusisaari  >> > > coriant.com>
> >> >
> >> > Acked-by: Konstantin Ananyev 
> >>
> >> Applied, thanks
> >
> > There was ongoing discussion on this
> > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> 
> Sorry Jerin, I forgot this email.
> The problem is that nobody replied to your email and you did not nack
> the v2 of this patch.
> 
> > This change may not be required as it has the performance impact.
> 
> We need to clearly understand what is the performance impact
> (numbers and use cases) on one hand, and is there a real bug fixed
> by this patch on the other hand?

IHMO, there is no real bug here. rte_smb_rmb() provides the LOAD-STORE
barrier to make sure tail pointer WRITE happens only after prior LOADS.

Thoughts?

> 
> Please guys make things clear and we'll revert if needed.


[dpdk-dev] [RFC] Generic flow director/filtering/classification API

2016-07-23 Thread John Fastabend
On 16-07-21 12:20 PM, Adrien Mazarguil wrote:
> Hi Jerin,
> 
> Sorry, looks like I missed your reply. Please see below.
> 

Hi Adrian,

Sorry for a bit delay but a few comments that may be worth considering.

To start with completely agree on the general problem statement and the
nice summary of all the current models. Also good start on this.

> 
> Considering that allowed pattern/actions combinations cannot be known in
> advance and would result in an unpractically large number of capabilities to
> expose, a method is provided to validate a given rule from the current
> device configuration state without actually adding it (akin to a "dry run"
> mode).

Rather than have a query/validate process why did we jump over having an
intermediate representation of the capabilities? Here you state it is
unpractical but we know how to represent parse graphs and the drivers
could report their supported parse graph via a single query to a middle
layer.

This will actually reduce the msg chatter imagine many applications at
init time or in boundary cases where a large set of applications come
online at once and start banging on the interface all at once seems less
than ideal.

Worse in my opinion it requires all drivers to write mostly duplicating
validation code where a common layer could easily do this if every
driver reported a common data structure representing its parse graph
instead. The nice fallout of this initial effort upfront is the driver
no longer needs to do error handling/checking/etc and can assume all
rules are correct and valid. It makes driver code much simpler to
support. And IMO at least by doing this we get some other nice benefits
described below.

Another related question is about performance.

> Creation
> 
> 
> Creating a flow rule is similar to validating one, except the rule is
> actually created.
> 
> ::
> 
>  struct rte_flow *
>  rte_flow_create(uint8_t port_id,
>  const struct rte_flow_pattern *pattern,
>  const struct rte_flow_actions *actions);

I gather this implies that each driver must parse the pattern/action
block and map this onto the hardware. How many rules per second can this
support? I've run into systems that expect a level of service somewhere
around 50k cmds per second. So bulking will help at the message level
but it seems like a lot of overhead to unpack the pattern/action section.

One strategy I've used in other systems that worked relatively well
is if the query for the parse graph above returns a key for each node
in the graph then a single lookup can map the key to a node. Its
unambiguous and then these operations simply become a table lookup.
So to be a bit more concrete this changes the pattern structure in
rte_flow_create() into a   tuple where the key is known
by the initial parse graph query. If you reserve a set of well-defined
key values for well known protocols like ethernet, ip, etc. then the
query model also works but the middle layer catches errors in this case
and again the driver only gets known good flows. So something like this,

  struct rte_flow_pattern {
uint32_t priority;
uint32_t key;
uint32_t value_length;
u8 *value;
  }

Also if we have multiple tables what do you think about adding a
table_id to the signature. Probably not needed in the first generation
but is likely useful for hardware with multiple tables so that it
would be,

   rte_flow_create(uint8_t port_id, uint8_t table_id, ...);

Finally one other problem we've had which would be great to address
if we are doing a rewrite of the API is adding new protocols to
already deployed DPDK stacks. This is mostly a Linux distribution
problem where you can't easily update DPDK.

In the prototype header linked in this document it seems to add new
headers requires adding a new enum in the rte_flow_item_type but there
is at least an attempt at a catch all here,

>   /**
>* Matches a string of a given length at a given offset (in bytes),
>* or anywhere in the payload of the current protocol layer
>* (including L2 header if used as the first item in the stack).
>*
>* See struct rte_flow_item_raw.
>*/
>   RTE_FLOW_ITEM_TYPE_RAW,

Actually this is a nice implementation because it works after the
previous item in the stack correct? So you can put it after "known"
variable length headers like IP. The limitation is it can't get past
undefined variable length headers. However if you use the above parse
graph reporting from the driver mechanism and the driver always reports
its largest supported graph then we don't have this issue where a new
hardware sku/ucode/etc added support for new headers but we have no
way to deploy it to existing software users without recompiling and
redeploying.

I looked at the git repo but I only saw the header definition I guess
the implementation is TBD after there is enough agreement on the
interface?

Thanks,
John


[dpdk-dev] Configuring NIC Tx arbiters in VMDK off, DCB off mode

2016-07-23 Thread Lavanya Jose
Hi everyone,

I found a snippet

of
code from a userspace driver that lets you configure weights for the
hardware NIC tx queues by configuring the RTTDT1C register in Intel 82599.
It looks like this is typically used for rate limiting VM traffic, with the
NIC in VMDQ mode. I was wondering whether I can do this without setting up
the NIC in VMDQ or DCB mode?

I am directly enqueuing CustomProto/UDP/IP/Ethernet packets from a single
process (two threads) on to specific hardware queues (without an
intermediate DPDK QOS/ Hierarchical Scheduler layer) and updating the
hardware queue rates, so the priority/ hardware rate limiting are the only
features I need (not VMDQ, DCB). Can I go about just changing the weights
as in the code above with VMDQ and DCB turned off?

Some initial experiments I did suggest that setting just the RTTDT1C
register for queues doesn't make any difference to the relative throughput
of the queues..

Thanks,
Lavanya


[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin


> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Saturday, July 23, 2016 12:49 PM
> To: Ananyev, Konstantin 
> Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> ordering before tail update
> 
> On Sat, Jul 23, 2016 at 11:15:27AM +, Ananyev, Konstantin wrote:
> >
> >
> > > -Original Message-
> > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > Sent: Saturday, July 23, 2016 11:39 AM
> > > To: Ananyev, Konstantin 
> > > Cc: Thomas Monjalon ; Juhamatti
> > > Kuusisaari ; dev at dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to
> > > guarantee ordering before tail update
> > >
> > > On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > > > Hi lads,
> > > >
> > > > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > > > caviumnetworks.com>:
> > > > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > > > >> > > fully before the tail is updated. This is not
> > > > > > >> > > guaranteed with a read barrier, changed to a write
> > > > > > >> > > barrier just before tail update which in
> > > > > practice guarantees correct order of reads and writes.
> > > > > > >> > >
> > > > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > > > >> > > 
> > > > > > >> >
> > > > > > >> > Acked-by: Konstantin Ananyev
> > > > > > >> > 
> > > > > > >>
> > > > > > >> Applied, thanks
> > > > > > >
> > > > > > > There was ongoing discussion on this
> > > > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > > > >
> > > > > > Sorry Jerin, I forgot this email.
> > > > > > The problem is that nobody replied to your email and you did
> > > > > > not nack the v2 of this patch.
> > > >
> > > > It's probably my bad.
> > > > I acked the patch before Jerin response, and forgot to reply later.
> > > >
> > > > > >
> > > > > > > This change may not be required as it has the performance impact.
> > > > > >
> > > > > > We need to clearly understand what is the performance impact
> > > > > > (numbers and use cases) on one hand, and is there a real bug
> > > > > > fixed by this patch on the other hand?
> > > > >
> > > > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > > > LOAD-STORE barrier to make sure tail pointer WRITE happens only after 
> > > > > prior LOADS.
> > > >
> > > > Yep, from what I read at the link Jerin provided, indeed it seems 
> > > > rte_smp_rmb() is enough for the arm arch here...
> > > > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > > > instruction.
> > > >
> > > > >
> > > > > Thoughts?
> > > >
> > > > Wonder how big is a performance impact?
> > >
> > > With this change we need to wait for addtional STORES to be completed to 
> > > local buffer in addtion to LOADS from ring buffers memory.
> >
> > I understand that, just wonder did you see any real performance difference?
> 
> Yeah...

Ok, then I don't see any good reason why we shouldn't revert it.
I suppose the best way would be to submit a new patch for RC5 to revert the 
changes.
Do you prefer to submit it yourself and I'll ack it or visa-versa?
Thanks
Konstantin 

> 
> > Probably with ring_perf_autotest/mempool_perf_autotest or something?
> 
> W/O change
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 4 
> MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
> (size: 8): 0 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue 
> (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 0.35
> MC empty dequeue: 0.60
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 0.93
> MP/MC bulk enq/dequeue (size: 8): 2.45
> SP/SC bulk enq/dequeue (size: 32): 0.58
> MP/MC bulk enq/dequeue (size: 32): 0.97
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 
> 1.89 MP/MC bulk enq/dequeue (size: 8): 4.28 SP/SC bulk
> enq/dequeue (size: 32): 0.90 MP/MC bulk enq/dequeue (size: 32): 1.19 Test OK
> RTE>>
> 
> With change
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 6 
> MP/MC single enq/dequeue: 16 SP/SC burst enq/dequeue
> (size: 8): 1 MP/MC burst enq/dequeue (size: 8): 2 SP/SC burst enq/dequeue 
> (size: 32): 0 MP/MC burst enq/dequeue (size: 32): 0
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 0.35
> MC empty dequeue: 0.60
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 1.28
> MP/MC bulk enq/dequeue (size: 8): 2.47
> SP/SC bulk enq/dequeue (size: 32): 0.64
> MP/MC bulk enq/dequeue (size: 32): 0.97
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 
> 

[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Jerin Jacob
On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > Consumer queue dequeuing must be guaranteed to be done fully before the 
> > > tail is updated. This is not guaranteed with a read barrier,
> > > changed to a write barrier just before tail update which in practice 
> > > guarantees correct order of reads and writes.
> > > 
> > > Signed-off-by: Juhamatti Kuusisaari 
> > 
> > Acked-by: Konstantin Ananyev 
> 
> Applied, thanks

There was ongoing discussion on this
http://dpdk.org/ml/archives/dev/2016-July/044168.html
This change may not be required as it has the performance impact.





[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin


> -Original Message-
> From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> Sent: Saturday, July 23, 2016 11:39 AM
> To: Ananyev, Konstantin 
> Cc: Thomas Monjalon ; Juhamatti Kuusisaari 
> ; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee 
> ordering before tail update
> 
> On Sat, Jul 23, 2016 at 10:14:51AM +, Ananyev, Konstantin wrote:
> > Hi lads,
> >
> > > On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > > > 2016-07-23 8:05 GMT+02:00 Jerin Jacob  > > > caviumnetworks.com>:
> > > > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > > > >> > > Consumer queue dequeuing must be guaranteed to be done
> > > > >> > > fully before the tail is updated. This is not guaranteed
> > > > >> > > with a read barrier, changed to a write barrier just before
> > > > >> > > tail update which in
> > > practice guarantees correct order of reads and writes.
> > > > >> > >
> > > > >> > > Signed-off-by: Juhamatti Kuusisaari
> > > > >> > > 
> > > > >> >
> > > > >> > Acked-by: Konstantin Ananyev 
> > > > >>
> > > > >> Applied, thanks
> > > > >
> > > > > There was ongoing discussion on this
> > > > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> > > >
> > > > Sorry Jerin, I forgot this email.
> > > > The problem is that nobody replied to your email and you did not
> > > > nack the v2 of this patch.
> >
> > It's probably my bad.
> > I acked the patch before Jerin response, and forgot to reply later.
> >
> > > >
> > > > > This change may not be required as it has the performance impact.
> > > >
> > > > We need to clearly understand what is the performance impact
> > > > (numbers and use cases) on one hand, and is there a real bug fixed
> > > > by this patch on the other hand?
> > >
> > > IHMO, there is no real bug here. rte_smb_rmb() provides the
> > > LOAD-STORE barrier to make sure tail pointer WRITE happens only after 
> > > prior LOADS.
> >
> > Yep, from what I read at the link Jerin provided, indeed it seems 
> > rte_smp_rmb() is enough for the arm arch here...
> > For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
> > instruction.
> >
> > >
> > > Thoughts?
> >
> > Wonder how big is a performance impact?
> 
> With this change we need to wait for addtional STORES to be completed to 
> local buffer in addtion to LOADS from ring buffers memory.

I understand that, just wonder did you see any real performance difference?
Probably with ring_perf_autotest/mempool_perf_autotest or something?
Konstantin 

> 
> > If there is a real one, I suppose we can revert the patch?
> 
> Request to revert this one as their no benifts for other architectures and 
> indeed it creates addtional delay in waiting for STORES to complete
> in ARM.
> Lets do the correct thing by reverting it.
> 
> Jerin
> 
> 
> 
> > Konstantin
> >
> > >
> > > >
> > > > Please guys make things clear and we'll revert if needed.


[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Thomas Monjalon
2016-07-23 8:05 GMT+02:00 Jerin Jacob :
> On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
>> > > Consumer queue dequeuing must be guaranteed to be done fully before the 
>> > > tail is updated. This is not guaranteed with a read barrier,
>> > > changed to a write barrier just before tail update which in practice 
>> > > guarantees correct order of reads and writes.
>> > >
>> > > Signed-off-by: Juhamatti Kuusisaari 
>> >
>> > Acked-by: Konstantin Ananyev 
>>
>> Applied, thanks
>
> There was ongoing discussion on this
> http://dpdk.org/ml/archives/dev/2016-July/044168.html

Sorry Jerin, I forgot this email.
The problem is that nobody replied to your email and you did not nack
the v2 of this patch.

> This change may not be required as it has the performance impact.

We need to clearly understand what is the performance impact
(numbers and use cases) on one hand, and is there a real bug fixed
by this patch on the other hand?

Please guys make things clear and we'll revert if needed.


[dpdk-dev] [PATCH] timer: fix break list when timer_cb reset running timer

2016-07-23 Thread Thomas Monjalon
2016-07-23 0:14 GMT+02:00 Sanford, Robert :
> Acked-by: Robert Sanford 
>
> I tested the three timer patches with app/test timer_autotest and
> timer_racecond_autotest, and additional private tests.

Thanks Robert.
Are you confident enough to integrate them in the last days of 16.07?
How critical are they?
Should we make a RC5 for them?


[dpdk-dev] [PATCH] lib: change rte_ring dequeue to guarantee ordering before tail update

2016-07-23 Thread Ananyev, Konstantin
Hi lads,

> On Sat, Jul 23, 2016 at 11:02:33AM +0200, Thomas Monjalon wrote:
> > 2016-07-23 8:05 GMT+02:00 Jerin Jacob :
> > > On Thu, Jul 21, 2016 at 11:26:50PM +0200, Thomas Monjalon wrote:
> > >> > > Consumer queue dequeuing must be guaranteed to be done fully
> > >> > > before the tail is updated. This is not guaranteed with a read 
> > >> > > barrier, changed to a write barrier just before tail update which in
> practice guarantees correct order of reads and writes.
> > >> > >
> > >> > > Signed-off-by: Juhamatti Kuusisaari
> > >> > > 
> > >> >
> > >> > Acked-by: Konstantin Ananyev 
> > >>
> > >> Applied, thanks
> > >
> > > There was ongoing discussion on this
> > > http://dpdk.org/ml/archives/dev/2016-July/044168.html
> >
> > Sorry Jerin, I forgot this email.
> > The problem is that nobody replied to your email and you did not nack
> > the v2 of this patch.

It's probably my bad.
I acked the patch before Jerin response, and forgot to reply later. 

> >
> > > This change may not be required as it has the performance impact.
> >
> > We need to clearly understand what is the performance impact (numbers
> > and use cases) on one hand, and is there a real bug fixed by this
> > patch on the other hand?
> 
> IHMO, there is no real bug here. rte_smb_rmb() provides the LOAD-STORE 
> barrier to make sure tail pointer WRITE happens only after prior
> LOADS.

Yep, from what I read at the link Jerin provided, indeed it seems rte_smp_rmb() 
is enough for the arm arch here...
For ppc, as I can see both rte_smp_rmb()/rte_smp_wmb() emits the same 
instruction.

> 
> Thoughts?

Wonder how big is a performance impact?
If there is a real one, I suppose we can revert the patch?
Konstantin 

> 
> >
> > Please guys make things clear and we'll revert if needed.


[dpdk-dev] [dpdk-announce] release candidate 16.07-rc4

2016-07-23 Thread Thomas Monjalon
A new DPDK release candidate is ready for testing:
http://dpdk.org/browse/dpdk/tag/?id=v16.07-rc4

This week was a bit more involved than expected.
Some important bugs have been fixed and require a strong
validation. Please test as much as you can.

The next release candidate should hopefully be the last one.
Please consider reading, discussing and validating the deprecation
notices which are waiting to be integrated in RC5 on Wednesday:
http://dpdk.org/dev/patchwork

Thank you