Eelco Chaudron <echau...@redhat.com> writes: > On 1 Feb 2024, at 18:28, Aaron Conole wrote: > >> Eelco Chaudron <echau...@redhat.com> writes: >> >>> On 31 Jan 2024, at 18:03, Aaron Conole wrote: >>> >>>> Eelco Chaudron <echau...@redhat.com> writes: >>>> >>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote: >>>>> >>>>>> From: Kevin Sprague <ksprague0...@gmail.com> >>>>>> >>>>>> During normal operations, it is useful to understand when a particular >>>>>> flow >>>>>> gets removed from the system. This can be useful when debugging >>>>>> performance >>>>>> issues tied to ofproto flow changes, trying to determine deployed traffic >>>>>> patterns, or while debugging dynamic systems where ports come and go. >>>>>> >>>>>> Prior to this change, there was a lack of visibility around flow >>>>>> expiration. >>>>>> The existing debugging infrastructure could tell us when a flow was >>>>>> added to >>>>>> the datapath, but not when it was removed or why. >>>>>> >>>>>> This change introduces a USDT probe at the point where the revalidator >>>>>> determines that the flow should be removed. Additionally, we track the >>>>>> reason for the flow eviction and provide that information as well. With >>>>>> this change, we can track the complete flow lifecycle for the netlink >>>>>> datapath by hooking the upcall tracepoint in kernel, the flow put USDT, >>>>>> and >>>>>> the revaldiator USDT, letting us watch as flows are added and removed >>>>>> from >>>>>> the kernel datapath. >>>>>> >>>>>> This change only enables this information via USDT probe, so it won't be >>>>>> possible to access this information any other way (see: >>>>>> Documentation/topics/usdt-probes.rst). >>>>>> >>>>>> Also included is a script (utilities/usdt-scripts/flow_reval_monitor.py) >>>>>> which serves as a demonstration of how the new USDT probe might be used >>>>>> going forward. >>>>>> >>>>>> Signed-off-by: Kevin Sprague <ksprague0...@gmail.com> >>>>>> Co-authored-by: Aaron Conole <acon...@redhat.com> >>>>>> Signed-off-by: Aaron Conole <acon...@redhat.com> >>>>> >>>>> Thanks for following this up Aaron! See comments on this patch below. I >>>>> have no additional comments on patch 2. >>>>> >>>>> Cheers, >>>>> >>>>> Eelco >>>>> >>>>> >>>>>> --- >>>>>> Documentation/topics/usdt-probes.rst | 1 + >>>>>> ofproto/ofproto-dpif-upcall.c | 42 +- >>>>>> utilities/automake.mk | 3 + >>>>>> utilities/usdt-scripts/flow_reval_monitor.py | 653 +++++++++++++++++++ >>>>>> 4 files changed, 693 insertions(+), 6 deletions(-) >>>>>> create mode 100755 utilities/usdt-scripts/flow_reval_monitor.py >>>>>> >>>>>> diff --git a/Documentation/topics/usdt-probes.rst >>>>>> b/Documentation/topics/usdt-probes.rst >>>>>> index e527f43bab..a8da9bb1f7 100644 >>>>>> --- a/Documentation/topics/usdt-probes.rst >>>>>> +++ b/Documentation/topics/usdt-probes.rst >>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``: >>>>>> - dpif_recv:recv_upcall >>>>>> - main:poll_block >>>>>> - main:run_start >>>>>> +- revalidate:flow_result >>>>>> - revalidate_ukey\_\_:entry >>>>>> - revalidate_ukey\_\_:exit >>>>>> - udpif_revalidator:start_dump >>>>> >>>>> You are missing the specific flow_result result section. This is from the >>>>> previous patch: >>>> >>>> D'oh! Thanks for catching it. I'll re-add it. >>>> >>>>> @@ -358,6 +360,27 @@ See also the ``main:run_start`` probe above. >>>>> - ``utilities/usdt-scripts/bridge_loop.bt`` >>>>> >>>>> >>>>> +probe revalidate:flow_result >>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>> + >>>>> +**Description**: >>>>> +This probe is triggered when the revalidator decides whether or not to >>>>> +revalidate a flow. ``reason`` is an enum that denotes that either the >>>>> flow >>>>> +is being kept, or the reason why the flow is being deleted. The >>>>> +``flow_reval_monitor.py`` script uses this probe to notify users when >>>>> flows >>>>> +matching user-provided criteria are deleted. >>>>> + >>>>> +**Arguments**: >>>>> + >>>>> +- *arg0*: ``(enum flow_del_reason) reason`` >>>>> +- *arg1*: ``(struct udpif *) udpif`` >>>>> +- *arg2*: ``(struct udpif_key *) ukey`` >>>>> + >>>>> +**Script references**: >>>>> + >>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py`` >>>>> + >>>>> + >>>>> Adding your own probes >>>>> ---------------------- >>>>> >>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c >>>>>> b/ofproto/ofproto-dpif-upcall.c >>>>>> index b5cbeed878..97d75833f7 100644 >>>>>> --- a/ofproto/ofproto-dpif-upcall.c >>>>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>>>> @@ -269,6 +269,18 @@ enum ukey_state { >>>>>> }; >>>>>> #define N_UKEY_STATES (UKEY_DELETED + 1) >>>>>> >>>>>> +enum flow_del_reason { >>>>>> + FDR_REVALIDATE = 0, /* The flow was revalidated. */ >>>>> >>>>> It was called FDR_FLOW_LIVE before, which might make more sense. As the >>>>> flow is just NOT deleted. It might or might not have been revalidated. >>>>> Thoughts? >>>> >>>> I think it had to have been revalidated if we emit the reason, because >>>> we only emit the reason code after revalidation. IE: there are many >>>> places where we skip revalidation but the flow stays live - and we don't >>>> emit reasons in those cases. >>>> >>>> So at least for this patch, it MUST have been revalidated. But maybe in >>>> the future, we would want to catch cases where the flow hasn't been. In >>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time - I >>>> think. >>>> >>>> Maybe you disagree? >>> >>> Well, it depends on how you define revalidation, it might only have updated >>> the counters. i.e. it all depends on ‘bool need_revalidate = >>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I opted >>> for a more general name. >> >> Yes, but FDR_FLOW_LIVE seems too generic. At least, FDR_REVALIDATE >> means it was flagged for revalidation, and in this case we did that, but >> we just didn't need to do any key comparisons. >> >> I guess it doesn't matter too much, but the comment is that the flow >> went through revalidation. If there becomes some other case where it >> never calls revalidate_ukey() maybe that makes sense? > > Looking at the probe again, we should have the result as well, as FDR_ stands > for flow delete reason, so if we do not delete the reason makes no sense. I > gues FDR_REVALIDATE, shoud he FDR_NONE. > > What about the following change making it clearer, and the probes to be more > precise (also change probe arg order to be more aligned with other probes): > > > diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c > index 97d75833f..61fba1766 100644 > --- a/ofproto/ofproto-dpif-upcall.c > +++ b/ofproto/ofproto-dpif-upcall.c > @@ -270,15 +270,17 @@ enum ukey_state { > #define N_UKEY_STATES (UKEY_DELETED + 1) > > enum flow_del_reason { > - FDR_REVALIDATE = 0, /* The flow was revalidated. */ > + FDR_NONE = 0, /* The flow was deleted. */ > + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */ > + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */ > FDR_FLOW_IDLE, /* The flow went unused and was deleted. */ > - FDR_TOO_EXPENSIVE, /* The flow was too expensive to revalidate. */ > + FDR_FLOW_LIMIT, /* Flow limit reached, removing all flows. */ > FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard mask. */ > - FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */ > + FDR_PURGE, /* User requested flow purge. */ > FDR_NO_OFPROTO, /* The flow didn't have an associated ofproto. */ > + FDR_TOO_EXPENSIVE, /* The flow was too expensive to revalidate. */ > + FDR_UPDATE_FAIL, /* Updating the flow in the dp failed. */ > FDR_XLATION_ERROR, /* There was an error translating the flow. */ > - FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */ > - FDR_FLOW_LIMIT, /* All flows being killed. */ > }; > > /* 'udpif_key's are responsible for tracking the little bit of state udpif > @@ -2285,7 +2287,7 @@ static enum reval_result > revalidate_ukey__(struct udpif *udpif, const struct udpif_key *ukey, > uint16_t tcp_flags, struct ofpbuf *odp_actions, > struct recirc_refs *recircs, struct xlate_cache *xcache, > - enum flow_del_reason *reason) > + enum flow_del_reason *del_reason) > { > struct xlate_out *xoutp; > struct netflow *netflow; > @@ -2306,13 +2308,13 @@ revalidate_ukey__(struct udpif *udpif, const struct > udpif_key *ukey, > netflow = NULL; > > if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) { > - *reason = FDR_XLATION_ERROR; > + *del_reason = FDR_XLATION_ERROR; > goto exit; > } > xoutp = &ctx.xout; > > if (xoutp->avoid_caching) { > - *reason = FDR_AVOID_CACHING; > + *del_reason = FDR_AVOID_CACHING; > goto exit; > } > > @@ -2326,7 +2328,7 @@ revalidate_ukey__(struct udpif *udpif, const struct > udpif_key *ukey, > ofpbuf_clear(odp_actions); > > if (!ofproto) { > - *reason = FDR_NO_OFPROTO; > + *del_reason = FDR_NO_OFPROTO; > goto exit; > } > > @@ -2338,7 +2340,7 @@ revalidate_ukey__(struct udpif *udpif, const struct > udpif_key *ukey, > if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask, &ctx.flow, > NULL) > == ODP_FIT_ERROR) { > - *reason = FDR_BAD_ODP_FIT; > + *del_reason = FDR_BAD_ODP_FIT; > goto exit; > } > > @@ -2348,7 +2350,7 @@ revalidate_ukey__(struct udpif *udpif, const struct > udpif_key *ukey, > * down. Note that we do not know if the datapath has ignored any of the > * wildcarded bits, so we may be overly conservative here. */ > if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) { > - *reason = FDR_FLOW_WILDCARDED; > + *del_reason = FDR_FLOW_WILDCARDED; > goto exit; > } > > @@ -2418,13 +2420,14 @@ static enum reval_result > revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey, > const struct dpif_flow_stats *stats, > struct ofpbuf *odp_actions, uint64_t reval_seq, > - struct recirc_refs *recircs, enum flow_del_reason *reason) > + struct recirc_refs *recircs, enum flow_del_reason > *del_reason) > OVS_REQUIRES(ukey->mutex) > { > bool need_revalidate = ukey->reval_seq != reval_seq; > enum reval_result result = UKEY_DELETE; > struct dpif_flow_stats push; > > + *del_reason = FDR_NONE; > ofpbuf_clear(odp_actions); > > push.used = stats->used; > @@ -2449,10 +2452,10 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key > *ukey, > } > result = revalidate_ukey__(udpif, ukey, push.tcp_flags, > odp_actions, recircs, ukey->xcache, > - reason); > + del_reason); > } else { > /* delete; too expensive to revalidate */ > - *reason = FDR_TOO_EXPENSIVE; > + *del_reason = FDR_TOO_EXPENSIVE; > } > } else if (!push.n_packets || ukey->xcache > || !populate_xcache(udpif, ukey, push.tcp_flags)) { > @@ -2853,8 +2856,8 @@ revalidate(struct revalidator *revalidator) > for (f = flows; f < &flows[n_dumped]; f++) { > long long int used = f->stats.used; > struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER; > - enum flow_del_reason reason = FDR_REVALIDATE; > struct dpif_flow_stats stats = f->stats; > + enum flow_del_reason del_reason; > enum reval_result result; > struct udpif_key *ukey; > bool already_dumped; > @@ -2929,13 +2932,13 @@ revalidate(struct revalidator *revalidator) > if (kill_them_all || (used && used < now - max_idle)) { > result = UKEY_DELETE; > if (kill_them_all) { > - reason = FDR_FLOW_LIMIT; > + del_reason = FDR_FLOW_LIMIT; > } else { > - reason = FDR_FLOW_IDLE; > + del_reason = FDR_FLOW_IDLE; > } > } else { > result = revalidate_ukey(udpif, ukey, &stats, &odp_actions, > - reval_seq, &recircs, &reason); > + reval_seq, &recircs, &del_reason); > } > ukey->dump_seq = dump_seq; > > @@ -2944,7 +2947,9 @@ revalidate(struct revalidator *revalidator) > udpif_update_flow_pps(udpif, ukey, f); > } > > - OVS_USDT_PROBE(revalidate, flow_result, reason, udpif, ukey); > + OVS_USDT_PROBE(revalidate, flow_result, udpif, ukey, result, > + del_reason); > + > if (result != UKEY_KEEP) { > /* Takes ownership of 'recircs'. */ > reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs, > @@ -2991,7 +2996,6 @@ revalidator_sweep__(struct revalidator *revalidator, > bool purge) > uint64_t odp_actions_stub[1024 / 8]; > struct ofpbuf odp_actions = > OFPBUF_STUB_INITIALIZER(odp_actions_stub); > > - enum flow_del_reason reason = FDR_REVALIDATE; > struct ukey_op ops[REVALIDATE_MAX_BATCH]; > struct udpif_key *ukey; > struct umap *umap = &udpif->ukeys[i]; > @@ -3012,10 +3016,12 @@ revalidator_sweep__(struct revalidator *revalidator, > bool purge) > struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER; > bool seq_mismatch = (ukey->dump_seq != dump_seq > && ukey->reval_seq != reval_seq); > + enum flow_del_reason del_reason = FDR_NONE; > enum reval_result result; > > if (purge || ukey_state == UKEY_INCONSISTENT) { > result = UKEY_DELETE; > + del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; > } else if (!seq_mismatch) { > result = UKEY_KEEP; > } else { > @@ -3023,8 +3029,12 @@ revalidator_sweep__(struct revalidator *revalidator, > bool purge) > COVERAGE_INC(revalidate_missed_dp_flow); > memcpy(&stats, &ukey->stats, sizeof stats); > result = revalidate_ukey(udpif, ukey, &stats, > &odp_actions, > - reval_seq, &recircs, &reason); > + reval_seq, &recircs, > &del_reason); > } > + > + OVS_USDT_PROBE(revalidator_sweep__, flow_result, udpif, ukey, > + result, del_reason); > + > if (result != UKEY_KEEP) { > /* Clears 'recircs' if filled by revalidate_ukey(). */ > reval_op_init(&ops[n_ops++], result, udpif, ukey, > &recircs, >
Okay - the above makes sense to me, with reasons included, so I'll fold it in. >>>>>> + FDR_FLOW_IDLE, /* The flow went unused and was deleted. */ >>>>>> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to >>>>>> revalidate. */ >>>>>> + FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard >>>>>> mask. */ >>>>>> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */ >>>>>> + FDR_NO_OFPROTO, /* The flow didn't have an associated >>>>>> ofproto. */ >>>>>> + FDR_XLATION_ERROR, /* There was an error translating the flow. >>>>>> */ >>>>>> + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */ >>>>>> + FDR_FLOW_LIMIT, /* All flows being killed. */ >>>>> >>>>> Looking at the comment from Han on FDR_PURGE, and this patch needing >>>>> another spin, we should probably add it. >>>> >>>> I can do that, sure. In that case, we will need to have a new flow op >>>> added to revalidator_sweep__ so that we can catch it. But in that case, >>>> it will be a different usdt probe, so I still don't know if we need >>>> FDR_PURGE right? WDYT? >>> >>> In revalidator_sweep__() you have sort of the following: >>> >>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>> result = UKEY_DELETE; >>> } else if (!seq_mismatch) { >>> >>> And I’m afraid that if we use this tool to debug we miss the ukey_state == >>> UKEY_INCONSISTENT when debugging and spent a long time figuring this out. >>> Maybe add something general like this (did not give it a lot of thought), >>> and only take the FDR_PURGE : FDR_UPDATE_FAIL results in the script? >> >> I see - I guess that only happens when the error case errors out, but I >> guess we can have it. >> >>> /* 'udpif_key's are responsible for tracking the little bit of state udpif >>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct revalidator >>> *revalidator, bool purge) >>> uint64_t odp_actions_stub[1024 / 8]; >>> struct ofpbuf odp_actions = >>> OFPBUF_STUB_INITIALIZER(odp_actions_stub); >>> >>> - enum flow_del_reason reason = FDR_REVALIDATE; >>> struct ukey_op ops[REVALIDATE_MAX_BATCH]; >>> struct udpif_key *ukey; >>> struct umap *umap = &udpif->ukeys[i]; >>> size_t n_ops = 0; >>> >>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>> + enum flow_del_reason reason = FDR_REVALIDATE; >>> enum ukey_state ukey_state; >>> >>> /* Handler threads could be holding a ukey lock while it >>> installs a >>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct revalidator *revalidator, >>> bool purge) >>> >>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>> result = UKEY_DELETE; >>> + reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; >>> } else if (!seq_mismatch) { >>> result = UKEY_KEEP; >>> + reason = FDR_REVALIDATE; //_KEEP >>> } else { >>> struct dpif_flow_stats stats; >>> COVERAGE_INC(revalidate_missed_dp_flow); >>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator *revalidator, >>> bool purge) >>> reval_op_init(&ops[n_ops++], result, udpif, ukey, >>> &recircs, >>> &odp_actions); >>> } >>> + OVS_USDT_PROBE(revalidator_sweep__, flow_result, result, >>> + reason, udpif, ukey); >>> } >>> ovs_mutex_unlock(&ukey->mutex); >>> >>> >>> >>> >>> In addition in revalidator_sweep__() should the “enum flow_del_reason >>> reason = FDR_REVALIDATE;” not be moved to the CMAP_FOR_EACH() loop? >> >> Okay - I can add this probe as well (and document it, etc). I think it >> probably will be pretty noisy in the cases it happens. We can change >> the scope as well. > > It will only be noisy on the flush, but this is fine. Now we will also catch > users doing stuff (you never know what kind of script/debugging they do in > production). > > See code suggestion above. ACK - thanks. >>> >>>>>> +}; >>>>>> + >>>>>> /* 'udpif_key's are responsible for tracking the little bit of state >>>>>> udpif >>>>>> * needs to do flow expiration which can't be pulled directly from the >>>>>> * datapath. They may be created by any handler or revalidator thread >>>>>> at any >>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, struct >>>>>> udpif_key *ukey, >>>>>> static enum reval_result >>>>>> revalidate_ukey__(struct udpif *udpif, const struct udpif_key *ukey, >>>>>> uint16_t tcp_flags, struct ofpbuf *odp_actions, >>>>>> - struct recirc_refs *recircs, struct xlate_cache >>>>>> *xcache) >>>>>> + struct recirc_refs *recircs, struct xlate_cache >>>>>> *xcache, >>>>>> + enum flow_del_reason *reason) >>>>>> { >>>>>> struct xlate_out *xoutp; >>>>>> struct netflow *netflow; >>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif, const >>>>>> struct udpif_key *ukey, >>>>>> netflow = NULL; >>>>>> >>>>>> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) { >>>>>> + *reason = FDR_XLATION_ERROR; >>>>>> goto exit; >>>>>> } >>>>>> xoutp = &ctx.xout; >>>>>> >>>>>> if (xoutp->avoid_caching) { >>>>>> + *reason = FDR_AVOID_CACHING; >>>>>> goto exit; >>>>>> } >>>>>> >>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif, const >>>>>> struct udpif_key *ukey, >>>>>> ofpbuf_clear(odp_actions); >>>>>> >>>>>> if (!ofproto) { >>>>>> + *reason = FDR_NO_OFPROTO; >>>>>> goto exit; >>>>>> } >>>>>> >>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif, const >>>>>> struct udpif_key *ukey, >>>>>> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask, >>>>>> &ctx.flow, >>>>>> NULL) >>>>>> == ODP_FIT_ERROR) { >>>>>> + *reason = FDR_BAD_ODP_FIT; >>>>>> goto exit; >>>>>> } >>>>>> >>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif, const >>>>>> struct udpif_key *ukey, >>>>>> * down. Note that we do not know if the datapath has ignored any >>>>>> of the >>>>>> * wildcarded bits, so we may be overly conservative here. */ >>>>>> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) { >>>>>> + *reason = FDR_FLOW_WILDCARDED; >>>>>> goto exit; >>>>>> } >>>>>> >>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result >>>>>> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey, >>>>>> const struct dpif_flow_stats *stats, >>>>>> struct ofpbuf *odp_actions, uint64_t reval_seq, >>>>>> - struct recirc_refs *recircs) >>>>>> + struct recirc_refs *recircs, enum flow_del_reason >>>>>> *reason) >>>>>> OVS_REQUIRES(ukey->mutex) >>>>>> { >>>>>> bool need_revalidate = ukey->reval_seq != reval_seq; >>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, struct >>>>>> udpif_key *ukey, >>>>>> xlate_cache_clear(ukey->xcache); >>>>>> } >>>>>> result = revalidate_ukey__(udpif, ukey, push.tcp_flags, >>>>>> - odp_actions, recircs, >>>>>> ukey->xcache); >>>>>> - } /* else delete; too expensive to revalidate */ >>>>>> + odp_actions, recircs, >>>>>> ukey->xcache, >>>>>> + reason); >>>>>> + } else { >>>>>> + /* delete; too expensive to revalidate */ >>>>>> + *reason = FDR_TOO_EXPENSIVE; >>>>>> + } >>>>>> } else if (!push.n_packets || ukey->xcache >>>>>> || !populate_xcache(udpif, ukey, push.tcp_flags)) { >>>>>> result = UKEY_KEEP; >>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator *revalidator) >>>>>> for (f = flows; f < &flows[n_dumped]; f++) { >>>>>> long long int used = f->stats.used; >>>>>> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER; >>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>> struct dpif_flow_stats stats = f->stats; >>>>>> enum reval_result result; >>>>>> struct udpif_key *ukey; >>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator *revalidator) >>>>>> } >>>>>> if (kill_them_all || (used && used < now - max_idle)) { >>>>>> result = UKEY_DELETE; >>>>>> + if (kill_them_all) { >>>>>> + reason = FDR_FLOW_LIMIT; >>>>>> + } else { >>>>>> + reason = FDR_FLOW_IDLE; >>>>>> + } >>>>>> } else { >>>>>> result = revalidate_ukey(udpif, ukey, &stats, >>>>>> &odp_actions, >>>>>> - reval_seq, &recircs); >>>>>> + reval_seq, &recircs, &reason); >>>>>> } >>>>>> ukey->dump_seq = dump_seq; >>>>>> >>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator *revalidator) >>>>>> udpif_update_flow_pps(udpif, ukey, f); >>>>>> } >>>>>> >>>>>> + OVS_USDT_PROBE(revalidate, flow_result, reason, udpif, >>>>>> ukey); >>>>>> if (result != UKEY_KEEP) { >>>>>> /* Takes ownership of 'recircs'. */ >>>>>> reval_op_init(&ops[n_ops++], result, udpif, ukey, >>>>>> &recircs, >>>>>> @@ -2962,6 +2991,7 @@ revalidator_sweep__(struct revalidator >>>>>> *revalidator, bool purge) >>>>>> uint64_t odp_actions_stub[1024 / 8]; >>>>>> struct ofpbuf odp_actions = >>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub); >>>>>> >>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>> struct ukey_op ops[REVALIDATE_MAX_BATCH]; >>>>>> struct udpif_key *ukey; >>>>>> struct umap *umap = &udpif->ukeys[i]; >>>>>> @@ -2993,7 +3023,7 @@ revalidator_sweep__(struct revalidator >>>>>> *revalidator, bool purge) >>>>>> COVERAGE_INC(revalidate_missed_dp_flow); >>>>>> memcpy(&stats, &ukey->stats, sizeof stats); >>>>>> result = revalidate_ukey(udpif, ukey, &stats, >>>>>> &odp_actions, >>>>>> - reval_seq, &recircs); >>>>>> + reval_seq, &recircs, >>>>>> &reason); >>>>>> } >>>>>> if (result != UKEY_KEEP) { >>>>>> /* Clears 'recircs' if filled by revalidate_ukey(). >>>>>> */ >>>>>> diff --git a/utilities/automake.mk b/utilities/automake.mk >>>>>> index 9a2114df40..146b8c37fb 100644 >>>>>> --- a/utilities/automake.mk >>>>>> +++ b/utilities/automake.mk >>>>>> @@ -23,6 +23,7 @@ scripts_DATA += utilities/ovs-lib >>>>>> usdt_SCRIPTS += \ >>>>>> utilities/usdt-scripts/bridge_loop.bt \ >>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \ >>>>>> utilities/usdt-scripts/kernel_delay.py \ >>>>>> utilities/usdt-scripts/kernel_delay.rst \ >>>>>> utilities/usdt-scripts/reval_monitor.py \ >>>>>> @@ -72,6 +73,7 @@ EXTRA_DIST += \ >>>>>> utilities/docker/debian/build-kernel-modules.sh \ >>>>>> utilities/usdt-scripts/bridge_loop.bt \ >>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \ >>>>>> utilities/usdt-scripts/kernel_delay.py \ >>>>>> utilities/usdt-scripts/kernel_delay.rst \ >>>>>> utilities/usdt-scripts/reval_monitor.py \ >>>>>> @@ -146,6 +148,7 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \ >>>>>> utilities/ovs-tcpdump.in \ >>>>>> utilities/ovs-pipegen.py \ >>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \ >>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \ >>>>>> utilities/usdt-scripts/upcall_monitor.py \ >>>>>> utilities/usdt-scripts/upcall_cost.py >>>>>> >>>>>> diff --git a/utilities/usdt-scripts/flow_reval_monitor.py >>>>>> b/utilities/usdt-scripts/flow_reval_monitor.py >>>>>> new file mode 100755 >>>>>> index 0000000000..e808020bb5 >>>>>> --- /dev/null >>>>>> +++ b/utilities/usdt-scripts/flow_reval_monitor.py >>>>>> @@ -0,0 +1,653 @@ >>>>>> +#!/usr/bin/env python3 >>>>>> +# >>>>>> +# Copyright (c) 2022 Redhat, Inc. >>>>>> +# >>>>>> +# Licensed under the Apache License, Version 2.0 (the "License"); >>>>>> +# you may not use this file except in compliance with the License. >>>>>> +# You may obtain a copy of the License at: >>>>>> +# >>>>>> +# http://www.apache.org/licenses/LICENSE-2.0 >>>>>> +# >>>>>> +# Unless required by applicable law or agreed to in writing, software >>>>>> +# distributed under the License is distributed on an "AS IS" BASIS, >>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >>>>>> implied. >>>>>> +# See the License for the specific language governing permissions and >>>>>> +# limitations under the License. >>>>>> +# >>>>>> +# Script information: >>>>>> +# ------------------- >>>>>> +# flow_reval_monitor.py uses the dpif_netlink_operate:flow_put and >>>>>> +# revalidator:flow_result USDT probes to monitor flow lifetimes and >>>>>> +# expiration events. By default, this will show all flow_put and flow >>>>>> +# expiration events, along with their reasons. This will look like so: >>>>>> +# >>>>>> +# TIME UFID >>>>>> EVENT/REASON >>>>>> +# 101536.226986736 ufid:f76fc899-376d-466b-bc74-0000b933eb97 >>>>>> flow_put >>>>>> +# 101536.227196214 ufid:d08472b6-110e-46cb-a9e4-00008f46838e >>>>>> flow_put >>>>>> +# 101541.516610178 ufid:fc5cc4a2-39e7-4a2d-bbce-000019665b32 >>>>>> flow_put >>>>>> +# 101541.516967303 ufid:fddd6510-26dc-4c87-8f7a-0000fc0c2c3a >>>>>> flow_put >>>>>> +# 101551.688050747 ufid:fddd6510-26dc-4c87-8f7a-0000fc0c2c3a flow >>>>>> timed out >>>>>> +# 101551.688077175 ufid:fc5cc4a2-39e7-4a2d-bbce-000019665b32 flow >>>>>> timed out >>>>>> +# 101557.695391371 ufid:f76fc899-376d-466b-bc74-0000b933eb97 flow >>>>>> timed out >>>>>> +# 101557.695408909 ufid:d08472b6-110e-46cb-a9e4-00008f46838e flow >>>>>> timed out >>>>>> +# >>>>>> +# flow key data can be printed using the --flow-keys option. This will >>>>>> +# print the equivalent datapath flow string. >>>>>> +# >>>>>> +# When filtering flows, the syntax is the same as used by >>>>>> +# `ovs-appctl dpctl/add-flow`. >>>>>> +# >>>>>> +# The following options are available: >>>>>> +# >>>>>> +# usage: flow_reval_monitor.py [-h] [--buffer-page-count NUMBER] >>>>>> +# [-k [FLOW_KEYS]] [-p VSWITCHD_PID] >>>>>> +# [-D [DEBUG]] [-f [FLOW STRING ...]] >>>>> >>>>> # usage: flow_reval_monitor.py [-h] [--buffer-page-count NUMBER] >>>>> [-f [64-2048]] [-k] [-l [FLOW_STRING >>>>> ...]] >>>>> [-p VSWITCHD_PID] [-D [DEBUG]] >>>> >>>> Oops, I'll fix it up. >>>> >>>>>> +# >>>>>> +# optional arguments: >>>>>> +# -h, --help show this help message and exit >>>>>> +# --buffer-page-count NUMBER >>>>>> +# Number of BPF ring buffer pages, default 1024 >>>>>> +# -f <64..2048>, --flow-key-size=<64..2048> >>>>>> +# Set the size of the flow key, default 64 >>>>>> +# -k, --flow-keys Print flow keys as flow strings >>>>>> +# -l [FLOW_STRING ...], --filter-flows [FLOW_STRING ...] >>>>>> +# Filter flows that match the specified >>>>>> ODP-like flow >>>>> >>>>> We do not filter on the flow itself but on the packet content/keys >>>>> creating the flow. >>>>> We might want to clarify this as the actual DP flow might not include., >>>>> for example, the IP fields. >>>> >>>> I guess it's ambiguous. I'll try and clean up the language. Because we >>>> are filtering on the ODP flow key, and not an openflow string. >>>> >>>>>> +# -p VSWITCHD_PID, --pid VSWITCHD_PID >>>>>> +# ovs-vswitchd's PID >>>>>> +# -D [DEBUG], --debug [DEBUG] >>>>>> +# Enable eBPF debugging >>>>>> +# >>>>>> +# Examples: >>>>>> +# >>>>>> +# To use the script on a running ovs-vswitchd to see flow keys and >>>>>> expiration >>>>>> +# events for flows with an ipv4 source of 192.168.10.10: >>>>>> +# $ ./flow_reval_monitor.py --flow-keys --filter-flows \ >>>>>> +# "ipv4(src=192.168.10.10)" >>>>> >>>>> Can we add some details on what kind of filters/format is supported? For >>>>> example no mask support. >>>> >>>> Sure, I can add it. >>>> >>>>>> +# TIME UFID >>>>>> EVENT/REASON >>>>>> +# 105082.457322742 ufid:f76fc899-376d-466b-bc74-0000b933eb97 >>>>>> flow_put >>>>>> +# ufid:f76fc899-376d-466b-bc74-0000b933eb97 has the following flow >>>>>> information: >>>>>> +# in_port(2), >>>>>> +# eth(src=0e:04:47:fc:74:51, dst=da:dc:c5:69:05:d7), \ >>>>>> +# eth_type(0x800), \ >>>>>> +# ipv4(src=192.168.10.10, dst=192.168.10.30, proto=1, tos=0, >>>>>> ttl=64,[...]), >>>>>> +# icmp(type=8, code=0) >>>>>> +# 105092.635450202 ufid:f76fc899-376d-466b-bc74-0000b933eb97 Flow >>>>>> timed out >>>>>> +# >>>>>> +# Notes: >>>>>> +# 1) No options are needed to attach when there is a single running >>>>>> instance >>>>>> +# of ovs-vswitchd. >>>>>> +# 2) If you're using the flow filtering option, it will only track >>>>>> flows that >>>>>> +# have been upcalled since the script began running. >>>>>> +# 3) When using the flow filtering option, the key size will likely >>>>>> need to >>>>>> +# be expanded to match on all the fields in the message. The >>>>>> default is >>>>>> +# kept small to keep the buffer copy sizes down when displaying >>>>>> +# flows (-k), but is hardcoded to 2048 when an actual filter (-l) >>>>>> is >>>>>> +# applied >>>>> >>>>> We should add a note that the flow_put part is not included when HW >>>>> offload (TC) is used for the kernel datapath, or if DPDK is used. >>>> >>>> That makes sense. But we will still have a revalidator output in >>>> f.e. the DPDK case, IIRC. >>> >>> True, I just want to make sure we are clear that we will not see the >>> flow_put messages hence the filtering will not work. >> >> Well, filtering could still work. Just needs a tweak of the bpf code to >> include the key_ptr and key_len data. Maybe we can still include it. >> For now, we don't pass on the udpif's key data, even though we have it >> available. > > The only problem here is performance, now we will send each deleted down, > whereas, in his original design, we only sent it down if it was in the ukey > table. > > Maybe we should keep it as is for now, and think about the TC and DPDK use > case in a follow up patch? That makes sense to me. >>>>>> >>>>>> +try: >>>>>> + from bcc import BPF >>>>>> + from bcc import USDT >>>>>> + from bcc import USDTException >>>>>> +except ModuleNotFoundError: >>>>>> + print("ERROR: Can't find the BPF Compiler Collection Tools.") >>>>>> + print("Please install them before running this script.") >>>>>> + exit(1) >>>>>> + >>>>>> +import argparse >>>>>> +from ipaddress import IPv4Address, IPv6Address >>>>>> +import psutil >>>>>> +import struct >>>>>> +import sys >>>>>> +import time >>>>>> + >>>>>> +# >>>>>> +# eBPF source code >>>>>> +# >>>>>> +bpf_src = """ >>>>>> +#include <linux/sched.h> >>>>>> +#include <uapi/linux/ptrace.h> >>>>>> + >>>>>> +#define MAX_KEY <MAX_KEY_VAL> >>>>>> +#define FLOW_FILTER <FILTER_BOOL> >>>>>> + >>>>>> +enum probe { OP_FLOW_PUT, FLOW_RESULT }; >>>>>> + >>>>>> +typedef union ovs_u128 { >>>>>> + unsigned int ufid32[4]; >>>>>> + unsigned long long ufid64[2]; >>>>>> +} ovs_u128; >>>>>> + >>>>>> +struct dpif_flow_put { >>>>>> + int flags; >>>>>> + void *key_ptr; >>>>>> + size_t key_len; >>>>>> + void *mask_ptr; >>>>>> + size_t mask_len; >>>>>> + u64 action_ptr; >>>>>> + size_t action_len; >>>>>> + void *ufid_ptr; >>>>>> +}; >>>>>> + >>>>>> +struct udpif_key { >>>>>> + void *cmap_node; >>>>>> + void *key_ptr; >>>>>> + size_t key_len; >>>>>> + void *mask_ptr; >>>>>> + size_t mask_len; >>>>>> + ovs_u128 ufid; >>>>>> +}; >>>>>> + >>>>>> +struct event_t { >>>>>> + u64 ts; >>>>>> + u32 reason; >>>>>> + u32 ufid[4]; /* Can't seem to make the ovs_u128 pass to python >>>>>> side. */ >>>>> >>>>> Is this still true? >>>> >>>> I didn't try it. Actually, I think these data structures can all be >>>> extracted with pahole or something which converts the ovs_u128. >>>> >>>> Actually I think there's some converter under the hood and it doesn't >>>> have a mapping of what 'ovs_u128' means. So we need to basically teach >>>> it to make it work if we want that. >>> >>> Ok, not a blocking thing, just wondered if there was a quick fix or not. I >>> thought it might be related to the BCC issue. >> >> No, not directly. There is a side effect of bcc relying on CTypes to do >> the type mapping - and in that case, it won't know what ovs_u128 looks >> like. It doesn't matter too much, because something like pahole will >> spit out a struct that will be basic data types (so bcc will understand >> it just fine). >> >> I don't think it makes sense to try and fix it up here. > > ACK > >>>>>> + u64 key_size; >>>>>> + u8 key[MAX_KEY]; >>>>>> + enum probe probe; >>>>>> +}; >>>>>> + >>>>>> +BPF_HASH(watchlist, ovs_u128); >>>>>> +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_COUNT>); >>>>>> + >>>>>> +int usdt__flow_result(struct pt_regs *ctx) { >>>>>> + u64 *ufid_present = NULL; >>>>>> + struct udpif_key ukey; >>>>>> + >>>>>> + bpf_usdt_readarg_p(3, ctx, &ukey, sizeof ukey); >>>>>> + ovs_u128 ufid = ukey.ufid; >>>>>> + ufid_present = watchlist.lookup(&ufid); >>>>>> + if(FLOW_FILTER && !ufid_present) { >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct >>>>>> event_t)); >>>>>> + if(!event) { >>>>>> + /* If we can't reserve the space in the ring buffer, return 1. >>>>>> */ >>>>> >>>>> See comments at the end regarding __sync_fetch_and_add(). >>>>> >>>>>> + return 1; >>>>>> + } >>>>>> + >>>>>> + event->probe = FLOW_RESULT; >>>>>> + event->ts = bpf_ktime_get_ns(); >>>>>> + bpf_probe_read(&event->ufid, sizeof ufid, &ufid); >>>>>> + bpf_usdt_readarg(1, ctx, &event->reason); >>>>>> + events.ringbuf_submit(event, 0); >>>>>> + >>>>>> + return 0; >>>>>> +}; >>>>>> + >>>>>> + >>>>>> +int usdt__op_flow_put(struct pt_regs *ctx) { >>>>>> + struct dpif_flow_put put; >>>>>> + ovs_u128 ufid; >>>>>> + >>>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct >>>>>> event_t)); >>>>>> + if(!event) { >>>>>> + /* If we can't reserve the space in the ring buffer, return 1. >>>>>> */ >>>>> >>>>> See comments at the end regarding __sync_fetch_and_add(). >>>>> >>>>>> + return 1; >>>>>> + } >>>>>> + >>>>>> + event->probe = OP_FLOW_PUT; >>>>>> + event->ts = bpf_ktime_get_ns(); >>>>>> + bpf_usdt_readarg_p(2, ctx, &put, sizeof put); >>>>>> + bpf_probe_read(&event->ufid, sizeof event->ufid, put.ufid_ptr); >>>>>> + bpf_probe_read(&ufid, sizeof ufid, &event->ufid); >>>>>> + if (put.key_len > MAX_KEY) { >>>>>> + put.key_len = MAX_KEY; >>>>>> + } >>>>>> + event->key_size = put.key_len; >>>>>> + bpf_probe_read(&event->key, put.key_len, put.key_ptr); >>>>>> + event->reason = 0; >>>>>> + events.ringbuf_submit(event, 0); >>>>>> + >>>>>> + watchlist.increment(ufid); >>>>>> + return 0; >>>>>> +}; >>>>>> +""" >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# buffer_size_type() >>>>>> +# >>>>>> +def buffer_size_type(astr, min=64, max=2048): >>>>>> + value = int(astr) >>>>>> + if min <= value <= max: >>>>>> + return value >>>>>> + else: >>>>>> + raise argparse.ArgumentTypeError( >>>>>> + 'value not in range {}-{}'.format(min, max)) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# format_ufid() >>>>>> +# >>>>>> +def format_ufid(ufid): >>>>>> + if ufid is None: >>>>>> + return "ufid:none" >>>>>> + >>>>>> + return "ufid:{:08x}-{:04x}-{:04x}-{:04x}-{:04x}{:08x}".format( >>>>>> + ufid[0], ufid[1] >> 16, ufid[1] & 0xffff, >>>>>> + ufid[2] >> 16, ufid[2] & 0, ufid[3]) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# find_and_delete_from_watchlist() >>>>>> +# >>>>>> +def find_and_delete_from_watchlist(event): >>>>>> + for k, _ in b["watchlist"].items(): >>>>>> + key_ufid = struct.unpack("=IIII", k) >>>>>> + if key_ufid == tuple(event.ufid): >>>>>> + key = (b["watchlist"].Key * 1)(k) >>>>>> + b["watchlist"].items_delete_batch(key) >>>>>> + break >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# handle_flow_put() >>>>>> +# >>>>>> +def handle_flow_put(event): >>>>>> + if args.flow_keys or args.filter_flows is not None: >>>>>> + key = decode_key(bytes(event.key)[:event.key_size]) >>>>>> + flow_dict, flow_str = parse_flow_dict(key) >>>>>> + # For each attribute that we're watching. >>>>>> + if args.filter_flows is not None: >>>>>> + if not compare_flow_to_target(args.filter_flows, flow_dict): >>>>>> + find_and_delete_from_watchlist(event) >>>>>> + return >>>>>> + >>>>>> + print("{:<18.9f} {:<45} {:<13}".format(event.ts / 1000000000, >>>>>> + format_ufid(event.ufid), "Insert (put) flow to kernel.")) >>>>> >>>>> Maybe change this to “Insert (put) flow to kernel module.” to valid >>>>> missing tc flow put? >>>> >>>> Ack. >>>> >>>>>> + >>>>>> + if args.flow_keys: >>>>>> + if len(flow_str) > 80:< >>>>>> + flow_str = " " + "),\n ".join(flow_str.split("), “))<< >>>>>> + else: >>>>>> + flow_str = " " + flow_str >>>>>> + print(" - It holds the following flow information:") >>>>> >>>>> This is confusing as, it’s not the flow information, i.e. flow installed, >>>>> but the keys from the packet. >>>> >>>> Agreed. >>>> >>>>>> + print(flow_str) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# compare_flow_to_target() >>>>>> +# >>>>>> +def compare_flow_to_target(target, flow): >>>>>> + for key in target: >>>>>> + if key not in flow: >>>>>> + return False >>>>>> + elif target[key] is True: >>>>>> + continue >>>>>> + elif target[key] == flow[key]: >>>>>> + continue >>>>>> + elif isinstance(target[key], dict) and isinstance(flow[key], >>>>>> dict): >>>>>> + return compare_flow_to_target(target[key], flow[key]) >>>>>> + else: >>>>>> + return False >>>>>> + return True >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# parse_flow_str() >>>>>> +# >>>>>> +def parse_flow_str(flow_str): >>>>>> + f_list = [i.strip(", ") for i in flow_str.split(")")] >>>>>> + if f_list[-1] == "": >>>>>> + f_list = f_list[:-1] >>>>>> + flow_dict = {} >>>>>> + for e in f_list: >>>>>> + split_list = e.split("(") >>>>>> + k = split_list[0] >>>>>> + if len(split_list) == 1: >>>>>> + flow_dict[k] = True >>>>>> + elif split_list[1].count("=") == 0: >>>>>> + flow_dict[k] = split_list[1] >>>>>> + else: >>>>>> + sub_dict = {} >>>>>> + sublist = [i.strip() for i in split_list[1].split(",")] >>>>>> + for subkey in sublist: >>>>>> + brk = subkey.find("=") >>>>>> + sub_dict[subkey[:brk]] = subkey[brk + 1:] >>>>>> + flow_dict[k] = sub_dict >>>>>> + return flow_dict >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# print_expiration() >>>>>> +# >>>>>> +def print_expiration(event): >>>>>> + reasons = ["Unknown flow expiration reason!", "Flow timed out", >>>>>> + "Flow revalidation too expensive", >>>>>> + "Flow needs narrower wildcard mask", >>>>>> + "Bad ODP flow fit", "Flow with associated ofproto", >>>>>> + "Flow translation error", "Flow cache avoidance", >>>>>> + "Kill them all signal"] >>>>> >>>>> Should we maybe define this with something like this: >>>>> >>>>> Event = IntEnum("flow_del_reason", ["FDR_FLOW_LIVE", >>>>> "FDR_FLOW_TIME_OUT", >>>>> ...], start=0) >>>>> >>>>> If we do this, we can also use flow_del_reason.FDR_FLOW_LIVE below. >>>> >>>> I wrote a bit below, but I was wondering if there's really a better way >>>> to do this like extracting the details from the code itself. But for >>>> now, I can hard code something in there like is done in the other >>>> revalidator script. >>> >>> Dont think we had scripts relying on OVS enums before. Not sure if pahole >>> can extract this also. >> >> I guess I didn't read the other scripts well enough. Yes, seems there's >> no good way to get this details. >> >>>>>> + ufid_str = format_ufid(event.ufid) >>>>>> + reason = event.reason >>>>>> + >>>>>> + if reason not in range(0, len(reasons) - 1): >>>>>> + reason = 0 >>>>>> + print("{:<18.9f} {:<45} {:<17}". >>>>>> + format(event.ts / 1000000000, ufid_str, reasons[reason])) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# decode_key() >>>>>> +# >>>>>> +def decode_key(msg): >>>>>> + bytes_left = len(msg) >>>>>> + result = {} >>>>>> + while bytes_left: >>>>>> + if bytes_left < 4: >>>>>> + break >>>>>> + nla_len, nla_type = struct.unpack("=HH", msg[:4]) >>>>>> + if nla_len < 4: >>>>>> + break >>>>>> + nla_data = msg[4:nla_len] >>>>>> + trunc = False >>>>>> + if nla_len > bytes_left: >>>>>> + trunc = True >>>>>> + nla_data = nla_data[:(bytes_left - 4)] >>>>> >>>>> Can we just not break out of this right away without doing >>>>> the two above lines? >>>> >>>> I'll double check - I think I can rewrite this section a bit. >>>> >>>>>> + else: >>>>>> + result[get_ovs_key_attr_str(nla_type)] = nla_data >>>>>> + if trunc: >>>>>> + break >>>>>> + next_offset = (nla_len + 3) & (~3) >>>>>> + msg = msg[next_offset:] >>>>>> + bytes_left -= next_offset >>>>> >>>>> if bytes_left: >>>>> “Can we report that our buffer was truncated?!” >>>>> >>>>> Not sure how to do this, but with 64 bytes being the default the -k >>>>> option only showed in_port() which took me a while to figure out. Maybe >>>>> 128 would be better when -k is configured? >>>> >>>> Good idea. Actually, I don't know if 64 bytes would ever really make >>>> sense anyway because it doesn't allow much to include. >>> >>> Agreed, I think 128 sounds like a good middle ground, however, it will not >>> decode ARP messages (have not tried ipv6). Maybe 64 is good enough if -k/-f >>> is not supplied (I guess we can even set it to 0 without -k or -f). >> >> Exactly. It's an option to trim on how large the event size is, and >> that means trimming on buffer space for the key data. >> >>>>>> + return result >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# get_ovs_key_attr_str() >>>>>> +# >>>>>> +def get_ovs_key_attr_str(attr): >>>>>> + ovs_key_attr = ["OVS_KEY_ATTR_UNSPEC", >>>>>> + "encap", >>>>>> + "skb_priority", >>>>>> + "in_port", >>>>>> + "eth", >>>>>> + "vlan", >>>>>> + "eth_type", >>>>>> + "ipv4", >>>>>> + "ipv6", >>>>>> + "tcp", >>>>>> + "udp", >>>>>> + "icmp", >>>>>> + "icmpv6", >>>>>> + "arp", >>>>>> + "nd", >>>>>> + "skb_mark", >>>>>> + "tunnel", >>>>>> + "sctp", >>>>>> + "tcp_flags", >>>>>> + "dp_hash", >>>>>> + "recirc_id", >>>>>> + "mpls", >>>>>> + "ct_state", >>>>>> + "ct_zone", >>>>>> + "ct_mark", >>>>>> + "ct_label", >>>>>> + "ct_tuple4", >>>>>> + "ct_tuple6", >>>>>> + "nsh"] >>>>>> + >>>>>> + if attr < 0 or attr > len(ovs_key_attr): >>>>>> + return "<UNKNOWN>: {}".format(attr) >>>>>> + return ovs_key_attr[attr] >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# is_nonzero() >>>>>> +# >>>>>> +def is_nonzero(val): >>>>>> + if isinstance(val, int): >>>>>> + return (val != 0) >>>>>> + >>>>>> + if isinstance(val, str): >>>>>> + val = bytes(val, "utf-8") >>>>>> + >>>>>> + # If it's not a string or an int, it's bytes. >>>>>> + return (val.count(0) < len(val)) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# parse_flow_dict() >>>>>> +# >>>>>> +def parse_flow_dict(key_dict, decode=True): >>>>>> + ret_str = "" >>>>>> + parseable = {} >>>>>> + skip = ["nsh", "tunnel", "mpls", "vlan"] >>>>>> + need_byte_swap = ["ct_label"] >>>>>> + ipv4addrs = ["ct_tuple4", "tunnel", "ipv4", "arp"] >>>>>> + ipv6addrs = ["ipv6", "nd", "ct_tuple6"] >>>>>> + macs = {"eth": [0, 1], "arp": [3, 4], "nd": [1, 2]} >>>>>> + fields = [("OVS_KEY_ATTR_UNSPEC"), >>>>>> + ("encap", ), >>>>>> + ("skb_priority", "<I"), >>>>>> + ("in_port", "<I"), >>>>>> + ("eth", "!6s6s", "src", "dst"), >>>>>> + ("vlan", ), >>>>>> + ("eth_type", "!H"), >>>>>> + ("ipv4", "!4s4s4B", "src", "dst", "proto", "tos", "ttl", >>>>>> "frag"), >>>>>> + ("ipv6", "!16s16s4s4B", "src", "dst", >>>>>> + "label", "proto", "tclass", "hlimit", "frag"), >>>>>> + ("tcp", "!2H", "src", "dst"), >>>>>> + ("udp", "!2H", "src", "dst"), >>>>>> + ("icmp", "!2B", "type", "code"), >>>>>> + ("icmpv6", "!2B", "type", "code"), >>>>>> + ("arp", "!4s4sH6s6s", "sip", "tip", "op", "sha", "tha"), >>>>>> + ("nd", "!16s6s6s", "target", "sll", "tll"), >>>>>> + ("skb_mark", "<I"), >>>>>> + ("tunnel", ), >>>>>> + ("sctp", "!2H", "src", "dst"), >>>>>> + ("tcp_flags", "!H"), >>>>>> + ("dp_hash", "<I"), >>>>>> + ("recirc_id", "<I"), >>>>>> + ("mpls", ), >>>>>> + ("ct_state", "<I"), >>>>>> + ("ct_zone", "<H"), >>>>>> + ("ct_mark", "<I"), >>>>>> + ("ct_label", "!16s"), >>>>>> + ("ct_tuple4", >>>>>> + "!4s4s2HB", "src", "dst", "tp_src", "tp_dst", "proto"), >>>>>> + ("ct_tuple6", >>>>>> + "!16s16sB2H", "src", "dst", "proto", "tp_src", "tp_dst"), >>>>>> + ("nsh", )] >>>>>> + for k, v in key_dict.items(): >>>>>> + s = "" >>>>>> + if k in skip: >>>>>> + continue >>>>>> + if decode and int.from_bytes(v, "big") == 0: >>>>>> + parseable[k] = "0" >>>>>> + continue >>>>>> + if decode and k in need_byte_swap: >>>>>> + v = int.from_bytes(v, "little").to_bytes(len(v), "big") >>>>>> + attr = -1 >>>>>> + found = False >>>>>> + for f in fields: >>>>>> + if k == f[0]: >>>>>> + attr = fields.index(f) >>>>>> + found = True >>>>>> + break >>>>>> + if not found: >>>>>> + raise KeyError("Invalid flow field '%s'" % k) >>>>>> + if decode and len(fields[attr]) > 1: >>>>>> + data = list(struct.unpack(fields[attr][1], >>>>>> + v[:struct.calcsize(fields[attr][1])])) >>>>>> + if k in ipv4addrs: >>>>>> + if data[0].count(0) < 4: >>>>>> + data[0] = str(IPv4Address(data[0])) >>>>>> + else: >>>>>> + data[0] = b"\x00" >>>>>> + if data[1].count(0) < 4: >>>>>> + data[1] = str(IPv4Address(data[1])) >>>>>> + else: >>>>>> + data[1] = b"\x00" >>>>>> + if k in ipv6addrs: >>>>>> + if data[0].count(0) < 16: >>>>>> + data[0] = str(IPv6Address(data[0])) >>>>>> + else: >>>>>> + data[0] = b"\x00" >>>>>> + if data[1].count(0) < len(data[1]): >>>>>> + data[1] = str(IPv6Address(data[1])) >>>>>> + else: >>>>>> + data[1] = b"\x00" >>>>>> + if k in macs.keys(): >>>>>> + for e in macs[k]: >>>>>> + if data[e].count(0) == 6: >>>>>> + mac_str = b"\x00" >>>>>> + else: >>>>>> + mac_str = ":".join(["%02x" % i for i in >>>>>> data[e]]) >>>>>> + data[e] = mac_str >>>>>> + if decode and len(fields[attr]) > 2: >>>>>> + field_dict = {field: d for field, d in >>>>>> zip(fields[attr][2:], data)} >>>>>> + s = ", ".join(k + "=" + str(v) for k, v in >>>>>> field_dict.items()) >>>>>> + elif decode and k != "eth_type": >>>>>> + s = str(data[0]) >>>>>> + field_dict = s >>>>>> + else: >>>>>> + if decode: >>>>>> + s = hex(data[0]) >>>>>> + field_dict = s >>>>>> + ret_str += k + "(" + s + "), " >>>>>> + parseable[k] = field_dict >>>>>> + ret_str = ret_str[:-2] >>>>>> + return (parseable, ret_str) >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# handle_event() >>>>>> +# >>>>>> +def handle_event(ctx, data, size): >>>>>> + # Once we grab the event, we have three cases. >>>>>> + # 1. It's a revalidator probe and the reason is nonzero: A flow is >>>>>> expiring >>>>>> + # 2. It's a revalidator probe and the reason is zero: flow >>>>>> revalidated >>>>>> + # 3. It's a flow_put probe. >>>>>> + # >>>>>> + # We will ignore case 2, and report all others. >>>>>> + # >>>>>> + event = b["events"].event(data) >>>>>> + if event.probe == 0: # OP_FLOW_PUT >>>>> >>>>> Here we should also define an enum for the probe events, see ‘Event = >>>>> IntEnum("Event”...’ and ‘<EVENT_ENUM>’ in reval_monitor.py >>>>> >>>>>> + handle_flow_put(event)< >>>>>> + elif event.probe == 1 and event.reason > 0: # FLOW_RESULT >>>>> >>>>> Here we could do “event.reason > flow_del_reason.FDR_FLOW_LIVE”, see >>>>> comment above. >>>> >>>> I can do the above, but I also am wondering if it's possible to have >>>> something we can use to fill up the enum dynamically without needing to >>>> duplicate things on the python side. >>> >>> That would be nice, maybe pahole already supports this. >>> >>>>>> + print_expiration(event) >>>>>> + >>>>>> + >>>>>> +def main(): >>>>>> + # >>>>>> + # Don't like these globals, but ctx passing does not work with the >>>>>> existing >>>>>> + # open_ring_buffer() API :( >>>>>> + # >>>>>> + global b >>>>>> + global args >>>>>> + >>>>>> + # >>>>>> + # Argument parsing >>>>>> + # >>>>>> + parser = argparse.ArgumentParser() >>>>>> + parser.add_argument("--buffer-page-count", >>>>>> + help="Number of BPF ring buffer pages, default >>>>>> 1024", >>>>>> + type=int, default=1024, metavar="NUMBER") >>>>>> + parser.add_argument("-f", "--flow-key-size", >>>>>> + help="Set maximum flow key size to capture, " >>>>>> + "default 64 - see notes", type=buffer_size_type, >>>>>> + default=64, metavar="[64-2048]") >>>>>> + parser.add_argument("-k", "--flow-keys", >>>>>> + help="Print flow keys as flow strings", >>>>>> + action="store_true") >>>>>> + parser.add_argument("-l", "--filter-flows", metavar="FLOW_STRING", >>>>>> + help="Filter flows that match the specified " >>>>>> + "ODP-like flow", >>>>>> + type=str, default=None, nargs="*") >>>>>> + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID", >>>>>> + help="ovs-vswitchd's PID", type=int, >>>>>> default=None) >>>>>> + parser.add_argument("-D", "--debug", help="Enable eBPF debugging", >>>>>> + type=int, const=0x3f, default=0, nargs="?") >>>>>> + args = parser.parse_args() >>>>>> + >>>>>> + # >>>>>> + # Find the PID of the ovs-vswitchd daemon if not specified. >>>>>> + # >>>>>> + if args.pid is None: >>>>>> + for proc in psutil.process_iter(): >>>>>> + if "ovs-vswitchd" in proc.name(): >>>>>> + if args.pid is not None: >>>>>> + print("Error: Multiple ovs-vswitchd daemons >>>>>> running, " >>>>>> + "use the -p option!") >>>>>> + sys.exit(-1) >>>>>> + >>>>>> + args.pid = proc.pid >>>>>> + # >>>>>> + # Error checking on input parameters >>>>>> + # >>>>>> + if args.pid is None: >>>>>> + print("ERROR: Failed to find ovs-vswitchd's PID!") >>>>>> + sys.exit(-1) >>>>>> + >>>>>> + # >>>>>> + # Attach the USDT probes >>>>>> + # >>>>>> + u = USDT(pid=int(args.pid)) >>>>>> + try: >>>>>> + u.enable_probe(probe="op_flow_put", fn_name="usdt__op_flow_put") >>>>>> + except USDTException as e: >>>>>> + print("Error attaching the dpif_netlink_operate__:op_flow_put >>>>>> probe.") >>>>>> + print(str(e)) >>>>>> + sys.exit(-1) >>>>>> + >>>>>> + try: >>>>>> + u.enable_probe(probe="flow_result", fn_name="usdt__flow_result") >>>>>> + except USDTException as e: >>>>>> + print("Error attaching the revalidate:flow_result probe.") >>>>>> + print(str(e)) >>>>>> + sys.exit(-1) >>>>>> + >>>>>> + # >>>>>> + # Attach the probes to the running process >>>>>> + # >>>>>> + source = bpf_src.replace("<BUFFER_PAGE_COUNT>", >>>>>> + str(args.buffer_page_count)) >>>>>> + >>>>>> + if args.filter_flows is None: >>>>>> + filter_bool = 0 >>>>>> + >>>>>> + # Set the key size based on what the user wanted >>>>>> + source = source.replace("<MAX_KEY_VAL>", >>>>>> str(args.flow_key_size)) >>>>>> + else: >>>>>> + filter_bool = 1 >>>>>> + args.filter_flows = parse_flow_str(args.filter_flows[0]) >>>>>> + >>>>>> + # Run through the parser to make sure we only filter on fields >>>>>> we >>>>>> + # understand >>>>>> + parse_flow_dict(args.filter_flows, False) >>>>>> + >>>>>> + # This is hardcoded here because it doesn't make sense to >>>>>> shrink the >>>>>> + # size, since the flow key might be missing fields that are >>>>>> matched in >>>>>> + # the flow filter. >>>>>> + source = source.replace("<MAX_KEY_VAL>", "2048") >>>>>> + >>>>>> + source = source.replace("<FILTER_BOOL>", str(filter_bool)) >>>>>> + >>>>>> + b = BPF(text=source, usdt_contexts=[u], debug=args.debug) >>>>>> + >>>>>> + # >>>>>> + # Print header >>>>>> + # >>>>>> + print("{:<18} {:<45} {:<17}".format("TIME", "UFID", "EVENT/REASON")) >>>>>> + >>>>>> + # >>>>>> + # Dump out all events. >>>>>> + # >>>>>> + b["events"].open_ring_buffer(handle_event) >>>>>> + while 1: >>>>>> + try: >>>>>> + b.ring_buffer_poll() >>>>>> + time.sleep(0.5) >>>>> >>>>> I think we can remove this sleep. >>>> >>>> I'll try without it. IIRC, the ring buffer polling was very aggressive >>>> on the CPU, but that is just a memory from mid-2022. >>> >>> I got the ‘remove’ comment from Adrian also a while back and did some tests >>> and I did not see any load increase on the Python application. But it might >>> be worth it for you to do the same, you never know where I screwed up ;) >> >> Okay, I'll retry it and drop if it causes an issue. >> >>>>>> + except KeyboardInterrupt: >>>>>> + break >>>>>> + >>>>>> + >>>>>> +# >>>>>> +# Start main() as the default entry point >>>>>> +#< >>>>>> +if __name__ == "__main__": >>>>>> + main() >>>>>> -- >>>>>> 2.41.0 >>>>> >>>>> Missing my previous comment on adding a check to make sure we do not lose >>>>> events: >>>>> >>>>> “ >>>>> Forgot to mention that you probably also want to add some checking to >>>>> make sure you do not lose events. >>>>> >>>>> See __sync_fetch_and_add() below: >>>>> >>>>> +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1); >>>>> + >>>>> +static struct event_t *get_event(uint32_t id) { >>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct >>>>> event_t)); >>>>> + >>>>> + if (!event) { >>>>> + uint32_t type = 0; >>>>> + uint64_t *value = dropcnt.lookup(&type); >>>>> + if (value) >>>>> + __sync_fetch_and_add(value, 1); >>>>> + >>>>> + return NULL; >>>>> + } >>>>> + >>>>> + event->id = id; >>>>> + event->ts = bpf_ktime_get_ns(); >>>>> + event->pid = bpf_get_current_pid_tgid(); >>>>> + >>>>> + return event; >>>>> +} >>>>> “ >>>>> >>>>> The other missing part is to include the PID/TID in the output so we can >>>>> relate to which revalidator thread did this (or add the comm with the >>>>> name). >>>> >>>> Okay. >>>> >>>>> And finally, the part that got this patch delayed, not adding static OVS >>>>> structure definitions. Which is still the case in this version. For now, >>>>> you should probably copy the get_ovs_definitions() implementation from >>>>> reval_monitor.py. >>>> >>>> Will do. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev