On 06/26/2018 12:40 PM, Radu Nicolau wrote: > From: Liang Ma <liang.j...@intel.com> > > 1. Abstract > > For packet processing workloads such as DPDK polling is continuous. > This means CPU cores always show 100% busy independent of how much work > those cores are doing. It is critical to accurately determine how busy > a core is hugely important for the following reasons: > > * No indication of overload conditions > > * User do not know how much real load is on a system meaning resulted in > wasted energy as no power management is utilized > > Tried and failed schemes include calculating the cycles required from > the load on the core, in other words the busyness. For example, > how many cycles it costs to handle each packet and determining the > frequency cost per core. Due to the varying nature of traffic, types of > frames and cost in cycles to process, this mechanism becomes complex > quickly where a simple scheme is required to solve the problems. > > 2. Proposed solution > > For all polling mechanism, the proposed solution focus on how many times > empty poll executed instead of calculating how many cycles it cost to > handle each packet. The less empty poll number means current core is busy > with processing workload, therefore, the higher frequency is needed. The > high empty poll number indicate current core has lots spare time, > therefore, we can lower the frequency. >
Hi Liang/Radu, I can see the benefit of providing an API for the application to provide the num rx from each poll, and then have the library step down/up the freq based on that. However, not sure I follow why you are adding the complexity of defining power states and training modes. > 2.1 Power state definition: > > LOW: the frequency is used for purge mode. > > MED: the frequency is used to process modest traffic workload. > > HIGH: the frequency is used to process busy traffic workload. > Why does there need to be user defined freq levels? Why not just keep stepping down the freq until there is some user-defined threshold of zero polls reached. e.g. keep stepping down until 10% of polls are zero poll and have a tail of some time (perhaps user defined) for the step down. > 2.2 There are two phases to establish the power management system: > > a.Initialization/Training phase. There is no traffic pass-through, > the system will test average empty poll numbers with > LOW/MED/HIGH power state. Those average empty poll numbers > will be the baseline > for the normal phase. The system will collect all core's counter > every 100ms. The Training phase will take 5 seconds. > This is requiring an application to sit for 5 secs in order to train and align poll numbers with states? That doesn't seem realistic to me. > b.Normal phase. When the real traffic pass-though, the system will > compare run-time empty poll moving average value with base line > then make decision to move to HIGH power state of MED power > state. The system will collect all core's counter every 10ms. > I only reviewed this commit msg and API usage, so maybe I didn't fully get the use case or details, but it seems quite awkward from an application perspective IMHO. > 3. Proposed API > > 1. rte_power_empty_poll_stat_init(void); > which is used to initialize the power management system. > > 2. rte_power_empty_poll_stat_free(void); > which is used to free the resource hold by power management system. > > 3. rte_power_empty_poll_stat_update(unsigned int lcore_id); > which is used to update specific core empty poll counter, not thread safe > > 4. rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt); > which is used to update specific core valid poll counter, not thread safe > I think 4 could be dropped and 3 used instead. It could be a simple API that takes in the core and nb_pkts from a poll. Seems clearer than making a separate API for a special value of nb_pkts (i.e. 0) and the application having to check to know which API should be called. > 5. rte_power_empty_poll_stat_fetch(unsigned int lcore_id); > which is used to get specific core empty poll counter. > > 6. rte_power_poll_stat_fetch(unsigned int lcore_id); > which is used to get specific core valid poll counter. > > 7. rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit); > which allow user customize the frequency of power state. > > 8. rte_power_empty_poll_setup_timer(void); > which is used to setup the timer/callback to process all above counter. > The new API should be experimental > ChangeLog: > v2: fix some coding style issues > v3: rename the filename, API name. > v4: updated makefile and symbol list > > Signed-off-by: Liang Ma <liang.j...@intel.com> > Signed-off-by: Radu Nicolau <radu.nico...@intel.com> > --- > lib/librte_power/Makefile | 5 +- > lib/librte_power/meson.build | 5 +- > lib/librte_power/rte_power_empty_poll.c | 521 > ++++++++++++++++++++++++++++++++ > lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++ > lib/librte_power/rte_power_version.map | 14 +- > 5 files changed, 742 insertions(+), 5 deletions(-) > create mode 100644 lib/librte_power/rte_power_empty_poll.c > create mode 100644 lib/librte_power/rte_power_empty_poll.h > Is there any in-tree documentation planned? Kevin.