From: ext Bill Fischofer [mailto:bill.fischo...@linaro.org]
Sent: Thursday, August 27, 2015 3:16 AM
To: Savolainen, Petri (Nokia - FI/Espoo)
Cc: LNG ODP Mailman List
Subject: Re: [lng-odp] [ARCH] Order Resolution APIs

I've posted v3 of the scheduler/ordered queue patch incorporating the changes 
we agreed to during today's ARCH call.  I've also extended 
test/validation/scheduler to add tests for the new Scheduler Group APIs.

There are a few points to note in using these APIs that I discovered in writing 
the validation tests:

1. Because odp_queue_enq() and queue_enq_multi() now sustain order, the program 
is running in an ordered context until it issues another odp_schedule(), 
odp_schedule_multi(), or odp_schedule_release_order() call.  In CUnit this can 
cause problems for tests that get events via the scheduler and then end the 
test.  If they don't add a call to odp_schedule_release_context() they next 
test may find itself running in the same ordered context and give unexpected 
results.  I had to add explicit release calls in a couple of tests for this 
reason.  Existing code that use "ordered queues" that in have been implemented 
as atomic queues since the beginning of ODP may find similar issues with these 
semantics, so beware.

Since release_atomic() and release_order() are hints. They may start and may 
even finish the context release, but performance reasons the release is not 
guaranteed to be finished when the function call returns. Only a schedule() 
call which returns 0 events guarantee that a thread does not hold any more a 
context. So each test case should call schedule() until no events come out and 
only then exit or continue to the next test case.

This is all in the spec. Test cases may have not honored it in this detail.


2. Given that odp_schedule_release_order() and odp_schedule_release_atomic() 
both take no arguments and implicitly refer to the current context, I think 
it's simpler to just combine these into a single odp_schedule_release_context() 
call.  Especially useful for general cleanup code which may not know what 
flavor of context it's running in.  This was discussed before, but I didn't 
want to depart from today's agreements in this patch. However it would be good 
to get input from other implementations as to whether this would complicate 
things for them. If agreeable, we can make this change as part of ODP v1.4

Agree we could deprecate specific calls and introduce release_ctx(). For same 
reasons, I didn’t remove release_atomic but added release_order in this first 
phase.

3. It was not at all clear to me how to use the new ordered lock semantics.  
The arguments are there in the linux-generic code, but they are marked 
ODP_UNUSED since the code refers to the current context implicitly rather than 
to any external object.  When I tried to incorporate them into the validation 
test I discovered I had no way of knowing how to issue a proper 
odp_schedule_order_lock_init() call.  The issue is that this call takes both a 
queue and a lock as arguments.  The obvious way to organize locks is in to 
declare them at the top of a thread and init them before entering the 
scheduling loop.

Like a spinlock, it’s a global resource and you must init it only once e.g. 
after queue create. The pointer can be stored to queue context data ( just like 
a spinlock pointer). The lock is needed to protect some queue specific shared 
data (e.g. sequence counter), so it’ not big deal to save the lock pointer  
along the data.

-Petri


void my_worker_thread()
{
        odp_schedule_order_lock_t my_ordered_lock;
        odp_queue_t from;
        ...other declarations

       ...other init code
       odp_schedule_order_lock_init(&my_ordered_lock, ??need an odp_queue_t 
here--what to use??);

       while (1) {
               ev = odp_schedule(&from, ...);
               odp_schedule_order_lock_init(&my_ordered_lock, from);  /* Syntax 
good, but is this what we want? */
               ...processing
               odp_schedule_order_lock(&my_ordered_lock);
               ...ordered critical section
               odp_schedule_order_unlock(&my_ordered_lock);
               ...finish up parallel stuff
        }
}

The lock/unlock calls seem straightforward, but that init() call for the lock 
is puzzling.  The lock is going to be associated with the queue that the 
scheduler has selected, however the thread has no way of knowing in advance 
what that queue might be.  If locks were somehow pre-initialized during queue 
creation that imposes its own problems since again how is the worker supposed 
to find the specific lock that was associated with the queue the scheduler 
selected?  Of course this gets even more confusing if we support multiple locks 
per ordered context and expect this sort of tie-in.

Of course, having these "locks" be private to each thread is strange to begin 
with since multiple threads processing events originating from the same ordered 
queue are looking to synchronize with each other.  So moving the init() call 
into the while() loop so that the ordered lock is re-initialized on ever pass 
with the queue is strange.  And passing a globally-initialized lock as an 
argument to the thread doesn't work since a the lock init() call references a 
specific queue and again which queue the scheduler will next select is unknown.

This was the reason I defined ordered locks to simply implicitly refer to the 
current ordered context, the same as all of the other APIs that relate to 
ordering do.  Unless I've missed something terribly obvious I think the current 
definition of ordered locks is basically unusable.

Bill


On Wed, Aug 26, 2015 at 6:30 AM, Bill Fischofer 
<bill.fischo...@linaro.org<mailto:bill.fischo...@linaro.org>> wrote:
Thanks.  Yes, tm_enq() needs to be upgraded to handle ordering as well.  That's 
an upgrade to the implementation of those APIs that will be done as soon as the 
implementation patches are posted.  Basically, any enq() operation needs to 
hook into the reordering logic if the event being enqueued originated from an 
ordered queue.  The patch as currently posted does this for "normal" target 
queues as well as for PKTOUT queues.  TM queues would need the same 
consideration.

Any of the options you mention can be done easily in SW.  The question is which 
one do we pick to make it easy for implementations to map the ODP APIs 
efficiently to their platforms, especially those that have ordering HW?  That's 
why I wanted feedback from implementers on this question.

While it's always possible to create arbitrarily complex API combinations for 
illustrative purposes, do we have use cases to support such constructs?  Right 
now, the only use case for order insertion that have been mentioned is packet 
segmentation and multicast.  Are there others that anyone is aware of?  Unless 
such use cases are expected to be common, it would seem to make sense to 
optimize the API semantics for the common case of one event in, one event out, 
and require the use of additional APIs or variants for the exceptional cases.  
That aligns both with the run-to-completion model as well as the performance 
requirements of mainline data plane processing.

The one optimization you mention (unlock an ordered lock and simultaneously 
release the ordered context) is not one I had considered since we haven't heard 
any requirements for this sort of combined semantics.  Does anyone have any 
requirement for such a combined operation?  If not then I'd agree that we not 
create combinations for their own sake.



On Wed, Aug 26, 2015 at 5:57 AM, Savolainen, Petri (Nokia - FI/Espoo) 
<petri.savolai...@nokia.com<mailto:petri.savolai...@nokia.com>> wrote:


From: lng-odp 
[mailto:lng-odp-boun...@lists.linaro.org<mailto:lng-odp-boun...@lists.linaro.org>]
 On Behalf Of ext Bill Fischofer
Sent: Wednesday, August 26, 2015 12:26 AM
To: LNG ODP Mailman List
Subject: [lng-odp] [ARCH] Order Resolution APIs

We've been discussion the question of when and how ordered events get resolved 
and I'd like to summarize the pros and cons as well as offer an additional 
suggestion to consider:

When odp_schedule() dispatches an event in an ordered context the system will 
guarantee that downstream queues will see events in the same relative order as 
they appeared on the originating ordered queue.  While most ordered events are 
expected to be processed in a straightforward manner (one event in, one event 
out) by a worker thread, there are two special cases of interest that require 
special consideration.

The first special case is removing an event from an ordered flow (one event in, 
none out).  The most common use-case for this scenario is IPfrag reassembly 
where multiple fragments are received and stored into a reassembly buffer but 
none is emitted until the last fragment completing the packet arrives.

The second special case is inserting one or more events into an ordered flow 
(one event in, multiple events out).  In this case what is desired is that the 
multiple output events should appear in the input event's order on any output 
queue(s) to which they are sent.  The simplest use-case for this scenario is 
packet segmentation where a large packet needs to be segmented for MTU or other 
reasons, or perhaps it is being replicated for multicast purposes.

As currently defined, order is implicitly resolved upon the next call to 
odp_schedule().  Doing this, however,  may be very inefficient on some 
platforms and as a result it is RECOMMENDED that threads running in ordered 
contexts resolve order explicitly whenever possible.

It is not defined how/where event order is resolved (inside enqueue, 
release_order, next schedule call, dequeue from destination queue,  …).

The rules are:

-          Order *must* be resolved before events are dequeued from the 
destination queue

-          Ordering is based on the ordering context and order is maintained as 
long as thread holds the context

-          The context *can* be released after odp_schedule_release_ordered() 
(user hints that ordering is not needed any more)

-          The context *must* be released in next schedule call, if still 
holding it


Order can be explicitly resolved via the odp_schedule_release_ordered() API 
that tells the scheduler that the thread no longer requires order to be 
maintained.  Following this call, the thread behaves as if it were running in a 
(normal) parallel context and the thread MUST assume that further enqueues it 
performs until the next odp_schedule() call will be unordered.

For the first special case (removing events from an ordered flow), releasing 
order explicitly MAY improve performance, depending on what the caller does 
between the odp_schedule_release_ordered() call and its next call to 
odp_schedule().  The more interesting (and apparently controversial) case 
involves processing that involves one or more enqueues in an ordered context.

odp_schedule_release_ordered() may improve performance in all cases. It tells 
to the implementation that

-          All enqueues (and potential other operations like tm_enqueue) that 
need ordering are done

-          The last enq operation was “the last”

-          All remaining order locks will not be called

-          In general, all serialization / synchronization for this context can 
be now freed



In some implementations, ordering is maintained as part of the scheduler while 
in others it is maintained as part of the queueing system.  Especially in 
systems that use HW assists for ordering, this can have meaningful performance 
implications.  In such systems, it is highly desirable that it be known at 
enqueue time whether or not this is final enqueue that the caller will make in 
the current ordered context.

Example 1:
-------------

enq()
enq()
enq()   // <= final enqueue
release_ordered()

Example 2:
-------------
enq()

if(1)
  enq()  // <= final enqueue
else
  enq()

if(0)
  enq()

release_ordered()


Example 3:
-------------
enq()

order_lock()
order_unlock()

if(1)
  tm_enq()  // <= final operation which needs ordering
else
  tm_enq()

if(0) {
  order_lock()
  order_unlock()
  enq()
}

release_ordered()


If implementation needs to identify the last enqueue, it needs to store the 
latest enqueue until the next enq or release_ordered() call. As soon as it sees 
release_ordered(), it knows which enqueue (or other operation requiring 
ordering) was the final.


There are several approaches that can be taken to provide this information.  
One is to provide an additional parameter to odp_queue_enq() and 
odp_queue_enq_multi() that explicitly says whether this is the final enqueue 
for a given ordered context or not.  Another approach is to have two separate 
APIs that are used to indicate whether an enqueue should or should not resolve 
order as part of its operation.

In the v2 version of the Ordering patch I've posted, the latter approach is 
used. The semantics of odp_queue_enq() and odp_queue_enq_multi() are extended 
to include order resolution as part of their processing when used in an ordered 
context.  For the (relatively rare) cases in which multiple enqueues are 
needed, the corresponding odp_queue_enq_sustain() and 
odp_queue_enq_multi_sustain() APIs are provided.  These simply perform ordered 
enqueues without releasing the ordered context.

The argument for having a new API to sustain order is that it is expected that 
most processing involves a single event in and single event out, so a different 
API call is needed only for the exceptional case.  So to do multiple enqueues 
in a single ordered contest the application would issue odp_queue_enq_sustain() 
one or more times followed by a final odp_queue_enq() call.  In the (common) 
case where the application is only issuing one enqueue, then odp_queue_enq() 
would be used as is the case today.

Yes, commonly application would enqueue one or few events, but it may not be 
simple or known in advance which operations and how many the application is 
going to do. See example 3 above.

Another approach would be to have odp_queue_enq() and odp_queue_enq_multi() 
sustain order by default and introduce new APIs odp_queue_enq_final() and 
odp_queue_enq_multi_final() as alternates that would be used to resolve order 
as part of the enqueue operation explicitly.  In that case to enqueue multiple 
events in an ordered contest the application would issue one or more 
odp_queue_enq() calls followed by an odp_queue_enq_final() call.  In the 
(common) case where the application is only issuing one enqueue, then 
odp_queue_enq_final() would be used.

I’m suggesting that these optimizations would be added later on (if needed)

odp_queue_enq_release_ctx() {
  odp_queue_enq()
  release_context()
}

odp_queue_enq_multi_release_ctx() {
  odp_queue_enq_multi()
  release_context()
}

odp_tm_enq_release_ctx() {
  odp_tm_enq()
  release_context()
}

odp_schedule_order_unlock_release_ctx() {
  odp_schedule_order_unlock()
  release_context()
}

The point is to make it very explicit when scheduling context (atomic or 
ordered) is released, and possibly extend the optimization to other calls which 
need ordering.


It's really a question of which approach is more convenient for the application 
writer, so I'd like to get some feedback on that question.

See above ☺

A third possibility would be to introduce both sustain and final variants of 
odp_queue_enq() and allow the default behavior of the "unqualified" versions of 
these APIs to be specified via another API, e.g., 
odp_queue_enq_order_set(ODP_SUSTAIN | ODP_RESOLVE) and (probably) a 
corresponding "getter" function odp_queue_enq_order() that would return the 
default behavior.  Applications could always explicitly call 
odp_queue_enq_sustain() or odp_queue_enq_resolve() to be independent of this 
setting if they wish.

Another optimization could be introduction of explicitly non-ordered calls. For 
example, if application is not ready to drop the ordered context yet but could 
allow some enqueues to be delivered out-of-order.


odp_queue_enq_no_sync() (or something similar) may not maintain ordering even 
when called from ordered context and thus could be faster than normal enqueue 
(from ordered context).

-Petri

Any of these options are trivially doable by extending the current patch, so 
I'd like some discussion of what would be the preferred approach from both 
application writers and platform implementer's perspectives.  Let's get some 
views posted here and we can also discuss during tomorrow's ARCH call.

Thanks.

Bill





_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Reply via email to