Thanks for review.
Yes, you're right that currently 'netdev_push_header' always returns 0.
I think it happened while introducing 'netdev_has_tunnel_push_pop()'. But
even before that, the error code was only for the ENOTSUPP and the only
place, where 'netdev_push_header' actually could fail is 'OVS_NOT_REACHED()'
inside 'dp_packet_resize__()'. And it will abort there.
So, the only way to make a proper error handling is to allow at least
'dp_packet_resize__()' to fail and return proper error code. But this will
require, I guess, a big refactoring of the whole dp_packet module with
it's dependencies.
But, yes, I still think that things like 'OVS_NOT_REACHED()' for the
reachable code paths is a really bad thing. And we need to fix this.
About my patch:
The main idea was just to 'break' instead of 'return' in case where
'may_steal == true'. This triggers the 'dp_packet_batch_free()' placed
at the end of the function and prevents the packet leak.
There are no changes in error handling code, only refactoring to prevent
double free. Maybe I should make it more clear in a first non-RFC version.
Maybe something like this:
--------------------------------------------------------------------------
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f86ed2a..9476a03 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -5672,12 +5672,23 @@ dp_execute_cb(void *aux_, struct dp_packet_batch
*packets_,
break;
case OVS_ACTION_ATTR_TUNNEL_PUSH:
- if (*depth < MAX_RECIRC_DEPTH) {
- dp_packet_batch_apply_cutlen(packets_);
- push_tnl_action(pmd, a, packets_);
- return;
+ /*
+ * XXX: 'may_steal' concept is broken here, because we're
+ * unconditionally changing the packets just like for other PUSH_*
+ * actions in 'odp_execute()'. 'false' value could be ignored,
+ * because we could reach here only after clone, but we still need
+ * to free the packets in case 'may_steal == true'.
+ */
+ if (may_steal) {
+ /* We're requested to push tunnel header, but also we need to take
+ * the ownership of these packets. So, we may not execute the
+ * action because the caller will not use the result anyway.
+ * Just break to free the batch. */
+ break;
}
- break;
+ dp_packet_batch_apply_cutlen(packets_);
+ push_tnl_action(pmd, a, packets_);
+ return;
case OVS_ACTION_ATTR_TUNNEL_POP:
if (*depth < MAX_RECIRC_DEPTH) {
--------------------------------------------------------------------------
Best regards, Ilya Maximets.
On 15.05.2018 09:43, Anju Thomas wrote:
> Hi Ilya,
> I had a look at the patch that you have submitted.
> In push_tnl_action, I can see that with the patch you are trying to return
> error in case we don’t find the tunnel port or return the error that
> netdev_push header returns. And we are deciding to delete the batch in the
> calling function dp_execute_cb in case of error. But netdev_push_header
> always returns 0. Actually the following code in push_tnl_action is dead
> code.
>
>
> if (!err) {
> return 0;
> }
>
> The actual issue is that the push_header function always return void . So any
> error happening during push_header is never captured or acted upon.
>
> I think the fix also be should be to add/capture error scenarios during tnl
> header push as well.
>
> Regards & Thanks
> Anju
>
> -----Original Message-----
> From: Ilya Maximets [mailto:[email protected]]
> Sent: Monday, May 14, 2018 6:28 PM
> To: [email protected]; Anju Thomas <[email protected]>; Ben
> Pfaff <[email protected]>
> Cc: Tiago Lam <[email protected]>
> Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push action
>
> Hello Anju, Ben,
>
> Looks like I fixed a leak reported here in my recent patch:
> https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/346892.html
>
> Could you, please, take a look at it?
>
> I actually managed to reproduce the packet leak on tunnel-push-pop unit tests
> with valgrind:
>
> ==6445== 2,912 (2,208 direct, 704 indirect) bytes in 4 blocks are definitely
> lost in loss record 400 of 410
> ==6445== at 0x4C2DB8F: malloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==6445== by 0x516314: xmalloc (util.c:120)
> ==6445== by 0x466154: dp_packet_new (dp-packet.c:138)
> ==6445== by 0x466154: dp_packet_new_with_headroom (dp-packet.c:148)
> ==6445== by 0x4F6CFE: eth_from_hex (packets.c:498)
> ==6445== by 0x48EB43: eth_from_packet (netdev-dummy.c:1450)
> ==6445== by 0x48EB43: netdev_dummy_receive (netdev-dummy.c:1562)
> ==6445== by 0x515980: process_command (unixctl.c:313)
> ==6445== by 0x515980: run_connection (unixctl.c:347)
> ==6445== by 0x515980: unixctl_server_run (unixctl.c:398)
> ==6445== by 0x406F1E: main (ovs-vswitchd.c:120)
>
> Above patch fixes it.
>
> Best regards, Ilya Maximets.
>
>> Hi Ben,
>>
>> I will work on these changes as well.
>>
>> Regards
>> Anju
>>
>> -----Original Message-----
>> From: Ben Pfaff [mailto:blp at ovn.org]
>> Sent: Friday, May 11, 2018 2:00 AM
>> To: Anju Thomas <anju.thomas at ericsson.com>
>> Cc: dev at openvswitch.org
>> Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push
>> action
>>
>> On Thu, May 10, 2018 at 08:51:03AM +0000, Anju Thomas wrote:
>>> It looks like your commit message describes at least two other bugs
>>> in OVS, though. First, if OVS crashes when it pushes tunnel headers,
>>> even if there's not enough headroom, that's really bad. At worst, it
>>> should drop the packet. Do you know where the crash occurs? We
>>> should fix the problem, since it might recur in some other context.
>>>
>>> [Anju] OVS was actually crashing in the dpdk datapath. The crash is a
>>> manual assert in our case.
>>> The rootcause is that dp receives the actions after the upcall (say with
>>> >=3 tunnel pushes ) . Now as part of action processing , since it is a
>>> tunnel push action , we try to find space in the dpdk mbuf packet headroom
>>> (which Is 128 bytes). By the time we try to process the third tunnel push ,
>>> there is no headroom left since each tunnel header is of 50 bytes (50 *3 >
>>> 128 bytes headroom). Hence it manually asserts . This assert is to catch
>>> any unexpected code flow . Do you think that in this case we should still
>>> go ahead and prevent the crash ?
>>
>> I don't understand why it's OK to crash in this case. Why do you think so?
>>
>>> Second, it's a little shocking to hear that an encapsulation action without
>>> a following output action causes a memory leak. We also need to fix that.
>>> Do you have any more details?
>>> [Anju] Now as explained above, the crash happens because we run out of
>>> headroom. But in case we have say 2 or less than 2 tunnel pushes we will
>>> have a mem leak as packet is never freed because the tnl push is the dp
>>> last action and there is no other output action or any other action like
>>> recirc that may translate to output action in the end leading to packet
>>> buffer not being freed.
>>> Are you proposing that we have some sort of preventive fix in the dp to
>>> handle an incorrect action list from the upcall handling?
>>
>> Yes. It's unacceptable to leak memory because there's a packet modification
>> without an output action. The kernel datapath never does this, for example.
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev