> >>>>>>> > >>>>>>>> Previously, the semantics of power monitor were such that we were > >>>>>>>> checking current value against the expected value, and if they > >>>>>>>> matched, > >>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it > >>>>>>>> only > >>>>>>>> allowed us to check for a specific value. > >>>>>>>> > >>>>>>>> This commit adds an option to reverse the check, so that we can have > >>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in > >>>>>>>> memory. This allows us to both implement all currently implemented > >>>>>>>> driver code, as well as support more use cases which don't easily > >>>>>>>> map to > >>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter > >>>>>>>> value). > >>>>>>>> > >>>>>>>> Since the old behavior is the default, no need to adjust existing > >>>>>>>> implementations. > >>>>>>>> > >>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com> > >>>>>>>> --- > >>>>>>>> lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++ > >>>>>>>> lib/eal/x86/rte_power_intrinsics.c | 5 ++++- > >>>>>>>> 2 files changed, 8 insertions(+), 1 deletion(-) > >>>>>>>> > >>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h > >>>>>>>> b/lib/eal/include/generic/rte_power_intrinsics.h > >>>>>>>> index dddca3d41c..1006c2edfc 100644 > >>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h > >>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h > >>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond { > >>>>>>>> * 4, or 8. Supplying any other value > >>>>>>>> will result in > >>>>>>>> * an error. > >>>>>>>> */ > >>>>>>>> + uint8_t invert; /**< Invert check for expected value (e.g. > >>>>>>>> instead of > >>>>>>>> + * checking if `val` matches something, > >>>>>>>> check if > >>>>>>>> + * `val` *doesn't* match a particular value) > >>>>>>>> + */ > >>>>>>>> }; > >>>>>>>> > >>>>>>>> /** > >>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c > >>>>>>>> b/lib/eal/x86/rte_power_intrinsics.c > >>>>>>>> index 39ea9fdecd..5d944e9aa4 100644 > >>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c > >>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c > >>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct > >>>>>>>> rte_power_monitor_cond *pmc, > >>>>>>>> const uint64_t masked = cur_value & pmc->mask; > >>>>>>>> > >>>>>>>> /* if the masked value is already matching, abort > >>>>>>>> */ > >>>>>>>> - if (masked == pmc->val) > >>>>>>>> + if (!pmc->invert && masked == pmc->val) > >>>>>>>> + goto end; > >>>>>>>> + /* same, but for inverse check */ > >>>>>>>> + if (pmc->invert && masked != pmc->val) > >>>>>>>> goto end; > >>>>>>>> } > >>>>>>>> > >>>>>>> > >>>>>>> Hmm..., such approach looks too 'patchy'... > >>>>>>> Can we at least replace 'inver' with something like: > >>>>>>> enum rte_power_monitor_cond_op { > >>>>>>> _EQ, NEQ,... > >>>>>>> }; > >>>>>>> Then at least new comparions ops can be added in future. > >>>>>>> Even better I think would be to just leave to PMD to provide a > >>>>>>> comparison callback. > >>>>>>> Will make things really simple and generic: > >>>>>>> struct rte_power_monitor_cond { > >>>>>>> volatile void *addr; > >>>>>>> int (*cmp)(uint64_t val); > >>>>>>> uint8_t size; > >>>>>>> }; > >>>>>>> And then in rte_power_monitor(...): > >>>>>>> .... > >>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size); > >>>>>>> if (pmc->cmp(cur_value) != 0) > >>>>>>> goto end; > >>>>>>> .... > >>>>>>> > >>>>>> > >>>>>> I like the idea of a callback, but these are supposed to be > >>>>>> intrinsic-like functions, so putting too much into them is contrary to > >>>>>> their goal, and it's going to make the API hard to use in simpler cases > >>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to > >>>>>> letting the RX callback do it for us). For example, event/dlb code > >>>>>> calls > >>>>>> rte_power_monitor explicitly. > >>>>> > >>>>> Good point, I didn't know that. > >>>>> Would be interesting to see how do they use it. > >>>> > >>>> To be fair, it should be possible to rewrite their code using a > >>>> callback. Perhaps adding a (void *) parameter for any custom data > >>>> related to the callback (because C doesn't have closures...), but > >>>> otherwise it should be doable, so the question isn't that it's > >>>> impossible to rewrite event/dlb code to use callbacks, it's more of an > >>>> issue with complicating usage of already-not-quite-straightforward API > >>>> even more. > >>>> > >>>>> > >>>>>> > >>>>>> It's going to be especially "fun" to do these indirect function calls > >>>>>> from inside transactional region on call to multi-monitor. > >>>>> > >>>>> But the callback is not supposed to do any memory reads/writes. > >>>>> Just mask/compare of the provided value with some constant. > >>>> > >>>> Yeah, but with callbacks we can't really control that, can we? I mean i > >>>> guess a *sane* implementation wouldn't do that, but still, it's > >>>> theoretically possible to perform more complex checks and even touch > >>>> some unrelated data in the process. > >>> > >>> Yep, PMD developer can ignore recommendations and do whatever > >>> he wants in the call-back. We can't control it. > >>> If he touches some memory in it - probably there will be more spurious > >>> wakeups and less power saves. > >>> In principle it is the same with all other PMD dev-ops - we have to trust > >>> that they are > >>> doing what they have to. > >> > >> I did a quick prototype for this, and i don't think it is going to work. > >> > >> Callbacks with just "current value" as argument will be pretty limited > >> and will only really work for cases where we know what we are expecting. > >> However, for cases like event/dlb or net/mlx5, the expected value is (or > >> appears to be) dependent upon some internal device data, and is not > >> constant like in case of net/ixgbe for example. > >> > >> This can be fixed by passing an opaque pointer, either by storing it in > >> the monitor condition, or by passing it directly to rte_power_monitor at > >> invocation time. > >> > >> The latter doesn't work well because when we call rte_power_monitor from > >> inside the rte_power library, we lack the context necessary to get said > >> opaque pointer. > >> > >> The former doesn't work either, because the only place where we can get > >> this argument is inside get_monitor_addr, but the opaque pointer must > >> persist after we exit that function in order to avoid use-after-free - > >> which means that it either has to be statically allocated (which means > >> it's not thread-safe for a non-trivial case), or dynamically allocated > >> (which a big no-no on a hotpath). > > > > If I get you right, expected_value (and probably mask) can be variable ones. > > So for callback approach to work we need to pass all this as parameters > > to PMD comparison callback: > > int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask) > > Correct? > > If we have both expected value, mask, and current value, then what's the > point of the callback? The point of the callback would be to pass just > the current value, and let the callback decide what's the expected value > and how to compare it.
For me the main point of callback is to hide PMD specific comparison semantics. Basically they provide us with some values in struct rte_power_monitor_cond, and then it is up to them how to interpret them in their comparison function. All we'll do for them: will read the value at address provided. I understand that it looks like an overkill, as majority of these comparison functions will be like: int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask) { return ((real_val & mask) == expected_val); } Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it. > > So, we can either let callback handle expected values itself by having > an opaque callback-specific argument (which means it has to persist > between .get_monitor_addr() and rte_power_monitor() calls), But that's what we doing already - PMD fills rte_power_monitor_cond values for us, we store them somewhere and then use them to decide should we go to sleep or not. All callback does - moves actual values interpretation back to PMD: Right now: PMD: provide PMC values POWER: store PMC values somewhere read the value at address provided in PMC interpret PMC values and newly read value and make the decision With callback: PMD: provide PMC values POWER: store PMC values somewhere read the value at address provided in PMC PMD: interpret PMC values and newly read value and make the decision Or did you mean something different here? >or we do the > comparisons inside rte_power_monitor(), and store the expected/mask > values in the monitor condition, and *don't* have any callbacks at all. > Are you suggesting an alternative to the above two options? As I said in my first mail - we can just replace 'inverse' with 'op'. That at least will make this API extendable, if someone will need something different in future. Another option is > > > > >> > >> Any other suggestions? :) > >> > >>> > >>>> > >>>>> > >>>>>> I'm not > >>>>>> opposed to having a callback here, but maybe others have more thoughts > >>>>>> on this? > >>>>>> > >>>>>> -- > >>>>>> Thanks, > >>>>>> Anatoly > >>>> > >>>> > >>>> -- > >>>> Thanks, > >>>> Anatoly > >> > >> > >> -- > >> Thanks, > >> Anatoly > > > -- > Thanks, > Anatoly