I will reiterate that I am using OFP and ODP in ways that were not originally 
intended and the problems I have exposed are in a way unique to that usage.  
But the problem with batching is still relevant.  I took a look at the 
default_event_dispatcher() in OFP v2.0 code and noticed that it calls 
odp_schedule_multi() and it blocks until it receives the burst size.  This will 
likely cause the starvation issues Bill mentioned.  

ODP does have a timeout in ns specified for odp_schedule_multi() to wait for 
the requested burst already that could be used here instead.  But this might 
not be the correct semantics, you don’t want to wait for the burst to happen if 
an event has been sitting on the queue since the last time through the schedule 
call, you would want to handle that event more quickly so instead the semantics 
should be in OFP (or ODP) “I want to call odp_schedule_multi() at a specific 
frequency”. If you are calling odp_schedule_multi() slower than your max-rate 
it immediately grabs what is available to avoid starvation, and if you are 
calling it too fast, it blocks for the remaining time left in the period.  
Since Bill is on this chain, an event digest may also be useful for the 
scheduler to provide to upper layers so they can query how much work may be 
available to make decisions on whether to call the full schedule routine.  I am 
not sure if that is really possible though with polling pktio, but would be 
nice to have.

Thanks,
Geoffrey Blake

On 1/25/17, 1:48 PM, "Bill Fischofer" <bill.fischo...@linaro.org> wrote:

    On Wed, Jan 25, 2017 at 10:16 AM, Mike Holmes <mike.hol...@linaro.org> 
wrote:
    > Relevant to ODP community CC'ed
    >
    > On 25 January 2017 at 11:14, Sorin Vultureanu <sorin.vulture...@enea.com> 
wrote:
    >> Hi,
    >>
    >>
    >>
    >> It looks like your issue is with ODP API
    >>
    >>
    >>
    >> IMO odp_schedule_multi() should look like this:
    >>
    >> odp_schedule_multi() should always return num if wait = ODP_SCHED_WAIT
    >>
    >> And 0 ... num if wait = ODP_SCHED_NO_WAIT.
    >>
    >>
    >>
    >> This would ensure application can control some batching. Right now I 
need to
    >> implement something above the ODP API to be able to receive a batch and
    >> minimize transaction overhead between ODP and APP.
    
    Good suggestion, but perhaps that can be improved further by saying
    that odp_schedule_multi() waits up to the specified wait time to try
    to return num events. That way a small wait time could be specified to
    attempt to eliminate "burstiness" in arrival rate even if we don't
    want to wait indefinitely to avoid excessive latency issues. Allowing
    an indefinite wait time to wait forever for num events when num > 1
    doesn't seem like a good idea as that invites starvation and severe
    latency issues when dealing with slower event arrival rates.
    
    Note also that not every ODP implementation is capable of returning
    more than 1 event per schedule call, which is why we've always allowed
    odp_schedule_multi() to return a single event even if more might be
    theoretically available. Applications should always be prepared to
    deal with the case of number of returned events < num even with this
    change, however on some platforms I can see how this might be
    beneficial.
    
    I'll add this to Monday's ODP ARCH call agenda.
    
    >>
    >>
    >> BR,
    >>
    >> Sorin
    >>
    >>
    >>
    >> From: openfastpath [mailto:openfastpath-boun...@list.openfastpath.org] On
    >> Behalf Of Geoffrey Blake
    >> Sent: Wednesday, January 25, 2017 3:54 PM
    >> To: Peltonen, Janne (Nokia - FI/Espoo) <janne.pelto...@nokia.com>; Bogdan
    >> Pricope <bogdan.pric...@enea.com>; openfastp...@list.openfastpath.org
    >>
    >>
    >> Cc: nd <n...@arm.com>
    >> Subject: Re: [openfastpath] Performance issues found in OFP 2.0 upon
    >> integration into Memcached
    >>
    >>
    >>
    >> Hi Janne,
    >>
    >>
    >>
    >> My workload is not stressing the full bandwidth of the NIC.  I am more
    >> interested in testing the behavior of the system when not under full 
load to
    >> understand what type of latency outliers exist.  These outliers are 
critical
    >> to understand for scale-out workloads in the data-center since they 
rarely
    >> utilize full bandwidth (there is some more compute per message), and 
these
    >> outliers affect the performance of the whole scale-out service.  From 
what
    >> I've been experimenting with (memcached), the OFP stack does not behave 
very
    >> well under less than full load when I try to tune the load to make the
    >> application meet a service level objective (I see large outlier latencies
    >> compared to Linux and they get worse as load decreases which is 
unexpected).
    >> This behavior does appear linked to the eagerness of of the polling loop 
to
    >> process a packet as soon as it arrives instead of trying to batch some 
like
    >> the current kernel stack.
    >>
    >>
    >>
    >> I do think that some form of sw GRO could definitely help in the 
presence of
    >> many flows for general purpose server applications.
    >>
    >>
    >>
    >> Thanks,
    >>
    >> Geoffrey Blake
    >>
    >>
    >>
    >> ________________________________
    >>
    >> From: Peltonen, Janne (Nokia - FI/Espoo) <janne.pelto...@nokia.com>
    >> Sent: Wednesday, January 25, 2017 3:22 AM
    >> To: Geoffrey Blake; Bogdan Pricope; openfastp...@list.openfastpath.org
    >> Cc: nd
    >> Subject: RE: [openfastpath] Performance issues found in OFP 2.0 upon
    >> integration into Memcached
    >>
    >>
    >>
    >> Hi Geoffrey,
    >>
    >>> During high load, the polling loop that is getting packets from
    >>> DPDK was routinely only getting up to 4 packets (and many times
    >>> only 1) before invoking OFP and polling again,
    >>
    >> After quickly looking at the code, it seems to me that odp-dpdk
    >> reads packets from DPDK PMDs in batches of up to 16 packets and
    >> I would expect that batch size to be reached when the stack as
    >> a whole reaches its maximum processing capacity.
    >>
    >> The number of packets scheduled at a time by odp_schedule_multi()
    >> is another thing and can be smaller than the batch size in the
    >> PMD polling.
    >>
    >> Which one are you referring to? If the PMD polling does not often
    >> have batch size of 16 packets or near that, then I would guess
    >> the stack (or at least the I/O part) is not truly near its
    >> maximum processing capacity. Maybe the TCP layer does something
    >> stupid or maybe there is enough packet loss or delays so that
    >> you reach max TCP throughput in your test before really reaching
    >> the max processing capacity?
    >>
    >>> and if the stack would benefit from a technique like GRO in
    >>> the Linux kernel.
    >>
    >> HW assisted GRO or such would need support in the ODP. Without
    >> that OFP cannot do much (some sort of SW GRO before entering
    >> TCP layer could be possible but I doubt it would make sense).
    >>
    >>         Janne
    >>
    >>> -----Original Message-----
    >>> From: Geoffrey Blake [mailto:geoffrey.bl...@arm.com]
    >>> Sent: Wednesday, January 25, 2017 12:53 AM
    >>> To: Bogdan Pricope <bogdan.pric...@enea.com>; Peltonen, Janne (Nokia -
    >>> FI/Espoo)
    >>> <janne.pelto...@nokia.com>; openfastp...@list.openfastpath.org
    >>> Cc: nd <n...@arm.com>
    >>> Subject: Re: [openfastpath] Performance issues found in OFP 2.0 upon
    >>> integration into Memcached
    >>>
    >>> Hi Bogdan, Janne
    >>>
    >>> I did a little more digging on my own to understand some of the
    >>> performance issues I was seeing
    >>> compared to the linux stack and thought the mailing list would find this
    >>> interesting as well.  I
    >>> took a look at the differences between the amount of work each stack is
    >>> routinely given and
    >>> noticed that OFP processes fewer packets per invocation than the Linux
    >>> stack under any load. I
    >>> ran an experiment with only 1 core and flow to simplify the environment.
    >>> During high load, the
    >>> polling loop that is getting packets from DPDK was routinely only 
getting
    >>> up to 4 packets (and
    >>> many times only 1) before invoking OFP and polling again, whereas the
    >>> Linux stack appears to
    >>> batch from 8-16 packets before sending them on to the stack, amortizing
    >>> the cost of the network
    >>> stack processing.   This is something for my use case I would likely 
need
    >>> to fix, but I would
    >>> offer as a suggestion to the community to consider slowing down the
    >>> polling loop to make sure to
    >>> batch packets before processing them.
    >>>
    >>> Another experiment I ran was varying the number of flows and I noticed
    >>> that OFP seems to suffer
    >>> slow downs at a higher rate than the Linux stack when it has to 
multiplex
    >>> an increasing number
    >>> of flows through the stack.  I was not able to get any good performance
    >>> data on this
    >>> unfortunately, but I wonder if the TCP connection meta-data is not
    >>> organized optimally in OFP,
    >>> and if the stack would benefit from a technique like GRO in the Linux
    >>> kernel.
    >>>
    >>> Thanks,
    >>> Geoffrey Blake
    >>>
    >>> On 1/23/17, 3:57 AM, "Bogdan Pricope" <bogdan.pric...@enea.com> wrote:
    >>>
    >>>     Hi Janne,
    >>>
    >>>     I sent to OFP mailing list a patch with the third mode
    >>> (scheduler_rss).
    >>>     As I mentioned in the patch, I could not see a clear performance
    >>> increase with the current
    >>> OFP implementation but can be useful as testing tool when we will 
improve
    >>> performance.
    >>>     The fourth mode (single queue, direct mode, and thread safe) can be
    >>> added later if we think
    >>> can show performance.
    >>>
    >>>     BR,
    >>>     Bogdan
    >>>
    >>>     > -----Original Message-----
    >>>     > From: Peltonen, Janne (Nokia - FI/Espoo)
    >>>     > [mailto:janne.pelto...@nokia.com]
    >>>     > Sent: Tuesday, January 10, 2017 3:59 PM
    >>>     > To: Bogdan Pricope <bogdan.pric...@enea.com>; Geoffrey Blake
    >>>     > <geoffrey.bl...@arm.com>; openfastp...@list.openfastpath.org
    >>>     > Cc: nd <n...@arm.com>
    >>>     > Subject: RE: [openfastpath] Performance issues found in OFP 2.0 
upon
    >>>     > integration into Memcached
    >>>     >
    >>>     >
    >>>     > Hi,
    >>>     >
    >>>     > > Bogdan:
    >>>     > > There is no confusion here: webserver2 works in two modes:
    >>>     > > 1) Default mode: scheduler mode, single atomic queue, no RSS -
    >>> this is
    >>>     > > because RSS and/or multiple input queues are not supported by 
all
    >>>     > > platforms
    >>>     > > 2) Direct mode, multiple pktins, RSS with hashing by TCP.
    >>>     >
    >>>     > I was just thinking it may give the wrong impression that with
    >>> direct packet
    >>>     > input mode you must have multiple input queues and vice versa, 
which
    >>> can
    >>>     > confuse the reader.
    >>>     >
    >>>     > > You are saying we should have a third mode: scheduler mode,
    >>> multiple
    >>>     > atomic queues, RSS by TCP?
    >>>     >
    >>>     > That is one possibility. Then there could be fourth mode for 
single
    >>> queue in
    >>>     > direct mode. And if queued I/O gets added, one would need even 
more
    >>>     > modes.
    >>>     >
    >>>     > I think it would be more clear to keep separate things separate 
and
    >>> have one
    >>>     > parameter for the packet input mode (scheduled, direct, (queued))
    >>> and
    >>>     > another one for the number of queues (e.g. 1, #worker threads).
    >>>     >
    >>>     >  Janne
    >>>     >
    >>>
    >>>
    >
    >
    >
    > --
    > Mike Holmes
    > Program Manager - Linaro Networking Group
    > Linaro.org │ Open source software for ARM SoCs
    > "Work should be fun and collaborative, the rest follows"
    

Reply via email to