I will reiterate that I am using OFP and ODP in ways that were not originally intended and the problems I have exposed are in a way unique to that usage. But the problem with batching is still relevant. I took a look at the default_event_dispatcher() in OFP v2.0 code and noticed that it calls odp_schedule_multi() and it blocks until it receives the burst size. This will likely cause the starvation issues Bill mentioned.
ODP does have a timeout in ns specified for odp_schedule_multi() to wait for the requested burst already that could be used here instead. But this might not be the correct semantics, you don’t want to wait for the burst to happen if an event has been sitting on the queue since the last time through the schedule call, you would want to handle that event more quickly so instead the semantics should be in OFP (or ODP) “I want to call odp_schedule_multi() at a specific frequency”. If you are calling odp_schedule_multi() slower than your max-rate it immediately grabs what is available to avoid starvation, and if you are calling it too fast, it blocks for the remaining time left in the period. Since Bill is on this chain, an event digest may also be useful for the scheduler to provide to upper layers so they can query how much work may be available to make decisions on whether to call the full schedule routine. I am not sure if that is really possible though with polling pktio, but would be nice to have. Thanks, Geoffrey Blake On 1/25/17, 1:48 PM, "Bill Fischofer" <bill.fischo...@linaro.org> wrote: On Wed, Jan 25, 2017 at 10:16 AM, Mike Holmes <mike.hol...@linaro.org> wrote: > Relevant to ODP community CC'ed > > On 25 January 2017 at 11:14, Sorin Vultureanu <sorin.vulture...@enea.com> wrote: >> Hi, >> >> >> >> It looks like your issue is with ODP API >> >> >> >> IMO odp_schedule_multi() should look like this: >> >> odp_schedule_multi() should always return num if wait = ODP_SCHED_WAIT >> >> And 0 ... num if wait = ODP_SCHED_NO_WAIT. >> >> >> >> This would ensure application can control some batching. Right now I need to >> implement something above the ODP API to be able to receive a batch and >> minimize transaction overhead between ODP and APP. Good suggestion, but perhaps that can be improved further by saying that odp_schedule_multi() waits up to the specified wait time to try to return num events. That way a small wait time could be specified to attempt to eliminate "burstiness" in arrival rate even if we don't want to wait indefinitely to avoid excessive latency issues. Allowing an indefinite wait time to wait forever for num events when num > 1 doesn't seem like a good idea as that invites starvation and severe latency issues when dealing with slower event arrival rates. Note also that not every ODP implementation is capable of returning more than 1 event per schedule call, which is why we've always allowed odp_schedule_multi() to return a single event even if more might be theoretically available. Applications should always be prepared to deal with the case of number of returned events < num even with this change, however on some platforms I can see how this might be beneficial. I'll add this to Monday's ODP ARCH call agenda. >> >> >> BR, >> >> Sorin >> >> >> >> From: openfastpath [mailto:openfastpath-boun...@list.openfastpath.org] On >> Behalf Of Geoffrey Blake >> Sent: Wednesday, January 25, 2017 3:54 PM >> To: Peltonen, Janne (Nokia - FI/Espoo) <janne.pelto...@nokia.com>; Bogdan >> Pricope <bogdan.pric...@enea.com>; openfastp...@list.openfastpath.org >> >> >> Cc: nd <n...@arm.com> >> Subject: Re: [openfastpath] Performance issues found in OFP 2.0 upon >> integration into Memcached >> >> >> >> Hi Janne, >> >> >> >> My workload is not stressing the full bandwidth of the NIC. I am more >> interested in testing the behavior of the system when not under full load to >> understand what type of latency outliers exist. These outliers are critical >> to understand for scale-out workloads in the data-center since they rarely >> utilize full bandwidth (there is some more compute per message), and these >> outliers affect the performance of the whole scale-out service. From what >> I've been experimenting with (memcached), the OFP stack does not behave very >> well under less than full load when I try to tune the load to make the >> application meet a service level objective (I see large outlier latencies >> compared to Linux and they get worse as load decreases which is unexpected). >> This behavior does appear linked to the eagerness of of the polling loop to >> process a packet as soon as it arrives instead of trying to batch some like >> the current kernel stack. >> >> >> >> I do think that some form of sw GRO could definitely help in the presence of >> many flows for general purpose server applications. >> >> >> >> Thanks, >> >> Geoffrey Blake >> >> >> >> ________________________________ >> >> From: Peltonen, Janne (Nokia - FI/Espoo) <janne.pelto...@nokia.com> >> Sent: Wednesday, January 25, 2017 3:22 AM >> To: Geoffrey Blake; Bogdan Pricope; openfastp...@list.openfastpath.org >> Cc: nd >> Subject: RE: [openfastpath] Performance issues found in OFP 2.0 upon >> integration into Memcached >> >> >> >> Hi Geoffrey, >> >>> During high load, the polling loop that is getting packets from >>> DPDK was routinely only getting up to 4 packets (and many times >>> only 1) before invoking OFP and polling again, >> >> After quickly looking at the code, it seems to me that odp-dpdk >> reads packets from DPDK PMDs in batches of up to 16 packets and >> I would expect that batch size to be reached when the stack as >> a whole reaches its maximum processing capacity. >> >> The number of packets scheduled at a time by odp_schedule_multi() >> is another thing and can be smaller than the batch size in the >> PMD polling. >> >> Which one are you referring to? If the PMD polling does not often >> have batch size of 16 packets or near that, then I would guess >> the stack (or at least the I/O part) is not truly near its >> maximum processing capacity. Maybe the TCP layer does something >> stupid or maybe there is enough packet loss or delays so that >> you reach max TCP throughput in your test before really reaching >> the max processing capacity? >> >>> and if the stack would benefit from a technique like GRO in >>> the Linux kernel. >> >> HW assisted GRO or such would need support in the ODP. Without >> that OFP cannot do much (some sort of SW GRO before entering >> TCP layer could be possible but I doubt it would make sense). >> >> Janne >> >>> -----Original Message----- >>> From: Geoffrey Blake [mailto:geoffrey.bl...@arm.com] >>> Sent: Wednesday, January 25, 2017 12:53 AM >>> To: Bogdan Pricope <bogdan.pric...@enea.com>; Peltonen, Janne (Nokia - >>> FI/Espoo) >>> <janne.pelto...@nokia.com>; openfastp...@list.openfastpath.org >>> Cc: nd <n...@arm.com> >>> Subject: Re: [openfastpath] Performance issues found in OFP 2.0 upon >>> integration into Memcached >>> >>> Hi Bogdan, Janne >>> >>> I did a little more digging on my own to understand some of the >>> performance issues I was seeing >>> compared to the linux stack and thought the mailing list would find this >>> interesting as well. I >>> took a look at the differences between the amount of work each stack is >>> routinely given and >>> noticed that OFP processes fewer packets per invocation than the Linux >>> stack under any load. I >>> ran an experiment with only 1 core and flow to simplify the environment. >>> During high load, the >>> polling loop that is getting packets from DPDK was routinely only getting >>> up to 4 packets (and >>> many times only 1) before invoking OFP and polling again, whereas the >>> Linux stack appears to >>> batch from 8-16 packets before sending them on to the stack, amortizing >>> the cost of the network >>> stack processing. This is something for my use case I would likely need >>> to fix, but I would >>> offer as a suggestion to the community to consider slowing down the >>> polling loop to make sure to >>> batch packets before processing them. >>> >>> Another experiment I ran was varying the number of flows and I noticed >>> that OFP seems to suffer >>> slow downs at a higher rate than the Linux stack when it has to multiplex >>> an increasing number >>> of flows through the stack. I was not able to get any good performance >>> data on this >>> unfortunately, but I wonder if the TCP connection meta-data is not >>> organized optimally in OFP, >>> and if the stack would benefit from a technique like GRO in the Linux >>> kernel. >>> >>> Thanks, >>> Geoffrey Blake >>> >>> On 1/23/17, 3:57 AM, "Bogdan Pricope" <bogdan.pric...@enea.com> wrote: >>> >>> Hi Janne, >>> >>> I sent to OFP mailing list a patch with the third mode >>> (scheduler_rss). >>> As I mentioned in the patch, I could not see a clear performance >>> increase with the current >>> OFP implementation but can be useful as testing tool when we will improve >>> performance. >>> The fourth mode (single queue, direct mode, and thread safe) can be >>> added later if we think >>> can show performance. >>> >>> BR, >>> Bogdan >>> >>> > -----Original Message----- >>> > From: Peltonen, Janne (Nokia - FI/Espoo) >>> > [mailto:janne.pelto...@nokia.com] >>> > Sent: Tuesday, January 10, 2017 3:59 PM >>> > To: Bogdan Pricope <bogdan.pric...@enea.com>; Geoffrey Blake >>> > <geoffrey.bl...@arm.com>; openfastp...@list.openfastpath.org >>> > Cc: nd <n...@arm.com> >>> > Subject: RE: [openfastpath] Performance issues found in OFP 2.0 upon >>> > integration into Memcached >>> > >>> > >>> > Hi, >>> > >>> > > Bogdan: >>> > > There is no confusion here: webserver2 works in two modes: >>> > > 1) Default mode: scheduler mode, single atomic queue, no RSS - >>> this is >>> > > because RSS and/or multiple input queues are not supported by all >>> > > platforms >>> > > 2) Direct mode, multiple pktins, RSS with hashing by TCP. >>> > >>> > I was just thinking it may give the wrong impression that with >>> direct packet >>> > input mode you must have multiple input queues and vice versa, which >>> can >>> > confuse the reader. >>> > >>> > > You are saying we should have a third mode: scheduler mode, >>> multiple >>> > atomic queues, RSS by TCP? >>> > >>> > That is one possibility. Then there could be fourth mode for single >>> queue in >>> > direct mode. And if queued I/O gets added, one would need even more >>> > modes. >>> > >>> > I think it would be more clear to keep separate things separate and >>> have one >>> > parameter for the packet input mode (scheduled, direct, (queued)) >>> and >>> > another one for the number of queues (e.g. 1, #worker threads). >>> > >>> > Janne >>> > >>> >>> > > > > -- > Mike Holmes > Program Manager - Linaro Networking Group > Linaro.org │ Open source software for ARM SoCs > "Work should be fun and collaborative, the rest follows"