Thanks for review. Yes, you're right that currently 'netdev_push_header' always returns 0. I think it happened while introducing 'netdev_has_tunnel_push_pop()'. But even before that, the error code was only for the ENOTSUPP and the only place, where 'netdev_push_header' actually could fail is 'OVS_NOT_REACHED()' inside 'dp_packet_resize__()'. And it will abort there.
So, the only way to make a proper error handling is to allow at least 'dp_packet_resize__()' to fail and return proper error code. But this will require, I guess, a big refactoring of the whole dp_packet module with it's dependencies. But, yes, I still think that things like 'OVS_NOT_REACHED()' for the reachable code paths is a really bad thing. And we need to fix this. About my patch: The main idea was just to 'break' instead of 'return' in case where 'may_steal == true'. This triggers the 'dp_packet_batch_free()' placed at the end of the function and prevents the packet leak. There are no changes in error handling code, only refactoring to prevent double free. Maybe I should make it more clear in a first non-RFC version. Maybe something like this: -------------------------------------------------------------------------- diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index f86ed2a..9476a03 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -5672,12 +5672,23 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_, break; case OVS_ACTION_ATTR_TUNNEL_PUSH: - if (*depth < MAX_RECIRC_DEPTH) { - dp_packet_batch_apply_cutlen(packets_); - push_tnl_action(pmd, a, packets_); - return; + /* + * XXX: 'may_steal' concept is broken here, because we're + * unconditionally changing the packets just like for other PUSH_* + * actions in 'odp_execute()'. 'false' value could be ignored, + * because we could reach here only after clone, but we still need + * to free the packets in case 'may_steal == true'. + */ + if (may_steal) { + /* We're requested to push tunnel header, but also we need to take + * the ownership of these packets. So, we may not execute the + * action because the caller will not use the result anyway. + * Just break to free the batch. */ + break; } - break; + dp_packet_batch_apply_cutlen(packets_); + push_tnl_action(pmd, a, packets_); + return; case OVS_ACTION_ATTR_TUNNEL_POP: if (*depth < MAX_RECIRC_DEPTH) { -------------------------------------------------------------------------- Best regards, Ilya Maximets. On 15.05.2018 09:43, Anju Thomas wrote: > Hi Ilya, > I had a look at the patch that you have submitted. > In push_tnl_action, I can see that with the patch you are trying to return > error in case we don’t find the tunnel port or return the error that > netdev_push header returns. And we are deciding to delete the batch in the > calling function dp_execute_cb in case of error. But netdev_push_header > always returns 0. Actually the following code in push_tnl_action is dead > code. > > > if (!err) { > return 0; > } > > The actual issue is that the push_header function always return void . So any > error happening during push_header is never captured or acted upon. > > I think the fix also be should be to add/capture error scenarios during tnl > header push as well. > > Regards & Thanks > Anju > > -----Original Message----- > From: Ilya Maximets [mailto:i.maxim...@samsung.com] > Sent: Monday, May 14, 2018 6:28 PM > To: ovs-dev@openvswitch.org; Anju Thomas <anju.tho...@ericsson.com>; Ben > Pfaff <b...@ovn.org> > Cc: Tiago Lam <tiago....@intel.com> > Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push action > > Hello Anju, Ben, > > Looks like I fixed a leak reported here in my recent patch: > https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/346892.html > > Could you, please, take a look at it? > > I actually managed to reproduce the packet leak on tunnel-push-pop unit tests > with valgrind: > > ==6445== 2,912 (2,208 direct, 704 indirect) bytes in 4 blocks are definitely > lost in loss record 400 of 410 > ==6445== at 0x4C2DB8F: malloc (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==6445== by 0x516314: xmalloc (util.c:120) > ==6445== by 0x466154: dp_packet_new (dp-packet.c:138) > ==6445== by 0x466154: dp_packet_new_with_headroom (dp-packet.c:148) > ==6445== by 0x4F6CFE: eth_from_hex (packets.c:498) > ==6445== by 0x48EB43: eth_from_packet (netdev-dummy.c:1450) > ==6445== by 0x48EB43: netdev_dummy_receive (netdev-dummy.c:1562) > ==6445== by 0x515980: process_command (unixctl.c:313) > ==6445== by 0x515980: run_connection (unixctl.c:347) > ==6445== by 0x515980: unixctl_server_run (unixctl.c:398) > ==6445== by 0x406F1E: main (ovs-vswitchd.c:120) > > Above patch fixes it. > > Best regards, Ilya Maximets. > >> Hi Ben, >> >> I will work on these changes as well. >> >> Regards >> Anju >> >> -----Original Message----- >> From: Ben Pfaff [mailto:blp at ovn.org] >> Sent: Friday, May 11, 2018 2:00 AM >> To: Anju Thomas <anju.thomas at ericsson.com> >> Cc: dev at openvswitch.org >> Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push >> action >> >> On Thu, May 10, 2018 at 08:51:03AM +0000, Anju Thomas wrote: >>> It looks like your commit message describes at least two other bugs >>> in OVS, though. First, if OVS crashes when it pushes tunnel headers, >>> even if there's not enough headroom, that's really bad. At worst, it >>> should drop the packet. Do you know where the crash occurs? We >>> should fix the problem, since it might recur in some other context. >>> >>> [Anju] OVS was actually crashing in the dpdk datapath. The crash is a >>> manual assert in our case. >>> The rootcause is that dp receives the actions after the upcall (say with >>> >=3 tunnel pushes ) . Now as part of action processing , since it is a >>> tunnel push action , we try to find space in the dpdk mbuf packet headroom >>> (which Is 128 bytes). By the time we try to process the third tunnel push , >>> there is no headroom left since each tunnel header is of 50 bytes (50 *3 > >>> 128 bytes headroom). Hence it manually asserts . This assert is to catch >>> any unexpected code flow . Do you think that in this case we should still >>> go ahead and prevent the crash ? >> >> I don't understand why it's OK to crash in this case. Why do you think so? >> >>> Second, it's a little shocking to hear that an encapsulation action without >>> a following output action causes a memory leak. We also need to fix that. >>> Do you have any more details? >>> [Anju] Now as explained above, the crash happens because we run out of >>> headroom. But in case we have say 2 or less than 2 tunnel pushes we will >>> have a mem leak as packet is never freed because the tnl push is the dp >>> last action and there is no other output action or any other action like >>> recirc that may translate to output action in the end leading to packet >>> buffer not being freed. >>> Are you proposing that we have some sort of preventive fix in the dp to >>> handle an incorrect action list from the upcall handling? >> >> Yes. It's unacceptable to leak memory because there's a packet modification >> without an output action. The kernel datapath never does this, for example. > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev