Hi, folks




Per GSO requirement, this is a must-have change, Jiayu, can you help review 
this series?


Olivier, if you used the old interface, maybe you need to change your code to 
adapt this, I don't think we can have a better way to handle GSO case.








At 2020-08-04 09:31:19, "yang_y_yi" <yang_y...@163.com> wrote:

At 2020-08-03 20:34:25, "Olivier Matz" <olivier.m...@6wind.com> wrote:
>On Mon, Aug 03, 2020 at 05:42:13PM +0800, yang_y_yi wrote:
>> At 2020-08-03 16:11:39, "Olivier Matz" <olivier.m...@6wind.com> wrote:
>> >On Mon, Aug 03, 2020 at 09:26:40AM +0800, yang_y_yi wrote:
>> >> At 2020-08-03 04:29:07, "Olivier Matz" <olivier.m...@6wind.com> wrote:
>> >> >Hi,
>> >> >
>> >> >On Sun, Aug 02, 2020 at 07:12:36AM +0800, yang_y_yi wrote:
>> >> >> 
>> >> >> 
>> >> >> At 2020-07-31 23:15:43, "Olivier Matz" <olivier.m...@6wind.com> wrote:
>> >> >> >Hi,
>> >> >> >
>> >> >> >On Thu, Jul 30, 2020 at 08:08:59PM +0800, yang_y...@163.com wrote:
>> >> >> >> From: Yi Yang <yangy...@inspur.com>
>> >> >> >> 
>> >> >> >> In GSO case, segmented mbufs are attached to original
>> >> >> >> mbuf which can't be freed when it is external. The issue
>> >> >> >> is free_cb doesn't know original mbuf and doesn't free
>> >> >> >> it when refcnt of shinfo is 0.
>> >> >> >> 
>> >> >> >> Original mbuf can be freed by rte_pktmbuf_free segmented
>> >> >> >> mbufs or by rte_pktmbuf_free original mbuf. Two kind of
>> >> >> >> cases should have different behaviors. free_cb won't
>> >> >> >> explicitly call rte_pktmbuf_free to free original mbuf
>> >> >> >> if it is freed by rte_pktmbuf_free original mbuf, but it
>> >> >> >> has to call rte_pktmbuf_free to free original mbuf if it
>> >> >> >> is freed by rte_pktmbuf_free segmented mbufs.
>> >> >> >> 
>> >> >> >> In order to fix this issue, free_cb interface has to been
>> >> >> >> changed, __rte_pktmbuf_free_extbuf must deliver called
>> >> >> >> mbuf pointer to free_cb, argument opaque can be defined
>> >> >> >> as a custom struct by user, it can includes original mbuf
>> >> >> >> pointer, user-defined free_cb can compare caller mbuf with
>> >> >> >> mbuf in opaque struct, free_cb should free original mbuf
>> >> >> >> if they are not same, this corresponds to rte_pktmbuf_free
>> >> >> >> segmented mbufs case, otherwise, free_cb won't free original
>> >> >> >> mbuf because the caller explicitly called rte_pktmbuf_free
>> >> >> >> to free it.
>> >> >> >> 
>> >> >> >> Here is pseduo code to show two kind of cases.
>> >> >> >> 
>> >> >> >> case 1. rte_pktmbuf_free segmented mbufs
>> >> >> >> 
>> >> >> >> nb_tx = rte_gso_segment(original_mbuf, /* original mbuf */
>> >> >> >>                       &gso_ctx,
>> >> >> >>                       /* segmented mbuf */
>> >> >> >>                       (struct rte_mbuf **)&gso_mbufs,
>> >> >> >>                       MAX_GSO_MBUFS);
>> >> >> >
>> >> >> >I'm sorry but it is not very clear to me what operations are done by
>> >> >> >rte_gso_segment().
>> >> >> >
>> >> >> >In the current code, I only see calls to rte_pktmbuf_attach(),
>> >> >> >which do not deal with external buffers. Am I missing something?
>> >> >> >
>> >> >> >Are you able to show the issue only with mbuf functions? It would
>> >> >> >be helpful to understand what does not work.
>> >> >> >
>> >> >> >
>> >> >> >Thanks,
>> >> >> >Olivier
>> >> >> >
>> >> >> Oliver, thank you for comment, let me show you why it doesn't work for 
>> >> >> my use case.  In OVS DPDK, VM uses vhostuserclient to send large 
>> >> >> packets whose size is about 64K because we enabled TSO & UFO, these 
>> >> >> large packets use rte_mbufs allocated by DPDK virtio_net function 
>> >> >> virtio_dev_pktmbuf_alloc() (in file lib/librte_vhost/virtio_net.c. 
>> >> >> Please refer to [PATCH V1 3/3], I changed free_cb as below, these 
>> >> >> packets use the same allocate function and the same free_cb no matter 
>> >> >> they are TCP packet or UDP packets, in case of VXLAN TSO, most NICs 
>> >> >> can't support inner UDP fragment offload, so OVS DPDK has to do it by 
>> >> >> software, for UDP case, the original rte_mbuf only can be freed by 
>> >> >> segmented rte_mbufs which are output packets of rte_gso_segment, i.e. 
>> >> >> the original rte_mbuf only can freed by free_cb, you can see, it 
>> >> >> explicitly called rte_pktmbuf_free(arg->mbuf), the condition statement 
>> >> >> "if (caller_m != arg->mbuf)" is true for this case, this has no 
>> >> >> problem, but for TCP case, the original mbuf is delivered to 
>> >> >> rte_eth_tx_burst() but not segmented rte_mbufs output by 
>> >> >> rte_gso_segment, PMD driver will call 
>> >> >> rte_pktmbuf_free(original_rte_mbuf) but not 
>> >> >> rte_pktmbuf_free(segmented_rte_mbufs), the same free_cb will be 
>> >> >> called, that means original_rte_mbuf will be freed twice, you know 
>> >> >> what will happen, this is just the issue I'm fixing. I bring in 
>> >> >> caller_m argument, it can help work around this because caller_m is 
>> >> >> arg->mbuf and the condition statement "if (caller_m != arg->mbuf)" is 
>> >> >> false, you can't fix it without the change this patch series did.
>> >> >
>> >> >I'm sill not sure to get your issue. Please, if you have a simple test
>> >> >case using only mbufs functions (without virtio, gso, ...), it would be
>> >> >very helpful because we will be sure that we are talking about the same
>> >> >thing. In case there is an issue, it can easily become a unit test.
>> >> 
>> >> Oliver, I think you don't get the point, free operation can't be 
>> >> controlled by the application itself, 
>> >> it is done by PMD driver and triggered by rte_eth_tx_burst, I have shown 
>> >> pseudo code,
>> >> rte_gso_segment just segments a large mbuf to multiple mbufs, it won't 
>> >> send them, the application
>> >> will call rte_eth_tx_burst to send them finally.
>> >>
>> >> >
>> >> >That said, I looked at vhost mbuf allocation and gso segmentation, and
>> >> >I found some strange things:
>> >> >
>> >> >1/ In virtio_dev_extbuf_alloc(), and I there are 2 paths to create the
>> >> >   ext mbuf.
>> >> >
>> >> >   a/ The first one stores the shinfo struct in the mbuf, basically
>> >> >      like this:
>> >> >
>> >> > pkt = rte_pktmbuf_alloc(mp);
>> >> > shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
>> >> > buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> > shinfo->free_cb = virtio_dev_extbuf_free;
>> >> > shinfo->fcb_opaque = buf;
>> >> > rte_mbuf_ext_refcnt_set(shinfo, 1);
>> >> >
>> >> >      I don't think it is a good idea, because there is no guarantee that
>> >> >      the mbuf won't be freed before the buffer. For instance, doing
>> >> >      this will probably fail:
>> >> >
>> >> > pkt2 = rte_pktmbuf_alloc(mp);
>> >> > rte_pktmbuf_attach(pkt2, pkt);
>> >> > rte_pktmbuf_free(pkt);  /* pkt is freed, but it contains shinfo ! */
>> >> 
>> >> pkt is created by the application I can control, so I can control it 
>> >> where it will be freed, right?
>> >
>> >This example shows that mbufs allocated like this by the vhost
>> >driver are not constructed correctly. If an application attach a new
>> >packet (pkt2) to it and frees the original one (pkt), it may result in a
>> >memory corruption.
>> >
>> >Of course, to be tested and confirmed.
>> 
>> No, attach will increase refcnt of shinfo, free_cb only is called when  
>> refcnt of shinfo is decreased to
>> 0, isn't it?
>
>When pkt will be freed, it will decrement the shinfo refcnt, and
>after it will be 1. So the buffer won't be freed. After that, the
>mbuf pkt will be detached, and will return to the mbuf pool. It means
>it can be reallocated, and the next user can overwrite shinfo which
>is still stored in the mbuf data.

I think this is an issue of DPDK itself, if external buffer in shinfo is freed, 
shinfo should be set to NULL, if user will
overwrite it, he/she should use the same way as a new external buffer is 
attached. 

>
>I did a test to show it, see:
>http://git.droids-corp.org/?p=dpdk.git;a=commitdiff;h=a617494eeb01ff
>
>If you run the mbuf autotest, it segfaults.

I think your test is wrong, you're changing shinfo (which is being used) in 
wrong way, if free_bc is NULL, it will be invalid.

static inline void
rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
        rte_iova_t buf_iova, uint16_t buf_len,
        struct rte_mbuf_ext_shared_info *shinfo)
{
        /* mbuf should not be read-only */
        RTE_ASSERT(RTE_MBUF_DIRECT(m) && rte_mbuf_refcnt_read(m) == 1);
        RTE_ASSERT(shinfo->free_cb != NULL);

For any shinfo operation, you should do it by rte_pktmbuf_attach_extbuf, you 
can't change it at will after that.

>
>
>> 
>> >
>> >> 
>> >> >
>> >> >      To do this properly, the mbuf refcnt should be increased, and
>> >> >      the mbuf should be freed in the callback. But I don't think it's
>> >> >      worth doing it, given the second path (b/) looks good to me.
>> >> >
>> >> >   b/ The second path stores the shinfo struct at the end of the
>> >> >      allocated buffer, like this:
>> >> >
>> >> > pkt = rte_pktmbuf_alloc(mp);
>> >> > buf_len += sizeof(*shinfo) + sizeof(uintptr_t);
>> >> > buf_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
>> >> > buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
>> >> > shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
>> >> >                                       virtio_dev_extbuf_free, buf);
>> >> >
>> >> >      I think this is correct, because we have the guarantee that shinfo
>> >> >      exists as long as the buffer exists.
>> >> 
>> >> What buffer does the allocated buffer you're saying here? The issue we're 
>> >> discussing how we can
>> >> free original mbuf which owns shinfo buffer.
>> >
>> >I don't get your question.
>> >
>> >I'm just saying that this code path looks correct, compared to
>> >the previous one.
>> 
>> I think you're challenging principle of external mbuf, that isn't the thing 
>> I address.
>
>I'm not challenging anything, I'm saying there is a bug in this code,
>and the unit test above tends to confirm it.

If it is bug, you can post a patch to fix it,  it isn't related with my 
patches. But in my opinion, you don't
use it in correct way, I don't think it is a bug.

>
>
>> 
>> >
>> >> 
>> >> >
>> >> >2/ in rte_gso_segment(), there is a loop like this:
>> >> >
>> >> > while (pkt_seg) {
>> >> >         rte_mbuf_refcnt_update(pkt_seg, -1);
>> >> >         pkt_seg = pkt_seg->next;
>> >> > }
>> >> >
>> >> >   You change it to take in account the refcnt for ext mbufs.
>> >> >
>> >> >   I may have missed something but I wonder why it is not simply:
>> >> >
>> >> > rte_pktmbuf_free(pkt_seg);
>> >> >
>> >> >   It will decrease the proper refcnt, and free the mbufs if they
>> >> >   are not used.
>> >> 
>> >> Again, rte_gso_segment just decreases refcnt by one, this will ensure the 
>> >> last segmented 
>> >> mbuf free will trigger freeing original mbuf (only free_cb can do this).
>> >
>> >rte_pktmbuf_free() will also decerase the refcnt, and free the resources
>> >when the refcnt reaches 0.
>> >
>> >It has some advantages compared to decrease the reference counter of all
>> >segments:
>> >
>> >- no need to iterate the segments, there is only one function call
>> >- no need to have a special case for ext mbufs like you did in your patch
>> >- it may be safer, in case some segments have a refcnt == 1, because
>> >  resources will be freed.
>> 
>> For external mbuf, attach only increases refcnt of shinfo, refcnt of mbuf 
>> won't be touched. For normal
>> mbuf, attach only increase refcnt of mbuf, no shinfo there, no refcnt of 
>> shinfo increased.
>
>I suppose rte_gso_segment() can take any mbuf type as input: standard
>mbuf, indirect mbuf, ext mbuf, or even a mbuf chaing containing segments of
>different types.
>
>For instance, if you pass a chain of 2 mbufs:
>- the first one is a direct mbuf containing the IP/TCP headers (orig_hdr)
>- the second on is a mbuf pointing to an ext buffer (orig_payload)
>
>I expect that the resulting mbuf after calling gso contains a list of mbufs
>like this:
>- a first segment containing the IP/TCP headers (new_hdr)
>- a payload segment pointing on the same ext buffer
>
>In theory, there is no reason that orig_hdr should be referenced by
>another new mbuf, because it only contains headers (no data). If that's
>the case, its refcnt is 1, and decreasing it to 0 without freeing it
>is a bug.

For this user scenario, orig_m is owner of external buffer, small segmented 
mbufs reference
it, you shouldn't free orig_m before all attached segmented mbufs are freed, 
isn't it?

>
>Anyway, there is maybe no issue in that case, but I was just suggesting
>that using rte_pktmbuf_free() is easier to read, and safer than manually
>decreasing the refcnt of each segment.
>
>
>> >> >Again, sorry if this is not the issue your are referring to, but
>> >> >in this case I really think that having a simple example code that
>> >> >shows the issue would help.
>> >> 
>> >> Oliver, my statement in the patch I sent out has pseudo code to show 
>> >> this.  I don't think a simple
>> >> unit test can show it.
>> >
>> >I don't see why. The PMDs and the libraries use the mbuf functions, why
>> >a unit test couldn't call the same functions?
>> >
>> >> Let me summarize it here again. For original mbuf, there are two cases 
>> >> freeing
>> >> it, case one is PMD driver calls free against segmented mbufs, last 
>> >> segmented mbuf free will trigger
>> >> free_cb call which will free original large & extended mbuf.
>> >
>> >OK
>> >
>> >> Case two is PMD driver will call free against
>> >> original mbuf, that also will call free_cb to free attached extended 
>> >> buffer.
>> >
>> >OK
>> >
>> >And what makes that case 1 or case 2 is executed?
>> >
>> >> In case one free_cb must call
>> >> rte_pktmbuf_free otherwise nobody will free original large & extended 
>> >> mbuf, in case two free_cb can't 
>> >> call rte_pktmbuf_free because the caller calling it is just 
>> >> rte_pktmbuf_free we need. That is to say, you
>> >> must use the same free_cb to handle these two cases, this is my issue and 
>> >> the point you don't get.
>> >
>> >I think there is no need to change the free_cb API. It should work like
>> >this:
>> >
>> >- virtio creates the original external mbuf (orig_m)
>> >- gso will create a new mbuf referencing the external buffer (new_m)
>> >
>> >At this point, the shinfo has a refcnt of 2. The large buffer will be
>> >freed as soon as rte_pktmbuf_free() is called on orig_m and new_m,
>> >whatever the order.
>> >
>> >Regards,
>> >Olivier
>> 
>> Oliver, the reason it works is I changed free_cb API, case 1 doesn't know 
>> orig_m, how you make it free orig_m in free_cb.
>> The intention I change free_cb is to let it know orig_m, I saw OVS DPDK ran 
>> out out buffers and orig_m isn't freed, that is why
>> I want to bring in this to fix the issue. Again, in case 1, nobody 
>> explicitly calls ret_pktmbuf_free(orig_m) except free_cb I changed.
>
>If nobody calls ret_pktmbuf_free(orig_m), it is a problem.
>The free_cb is to free the buffer, not the mbuf.
>
>To me, it should work like this:
>
>1- virtio creates a mbuf attached to the ext buffer (the shinfo placement
>   bug should be fixed)
>2- gso create mbufs that reference the the same ext buf (by attaching the
>   new mbuf)
>3- gso must free the original mbuf

This is impossible, segmented mbufs are referencing external buffer in original 
mbuf,
how do you free it? As I said rte_gso_segment has no way to free it, please 
tell me a way if
you know how to do this.

>4- the PMD transmits the new mbufs, and frees them
>
>Whatever 3- or 4- is executed first, at the end we are sure that:
>- all mbufs will be returned to the pool
>- the linear buffer will be freed when the refcnt reaches 0.
>
>If this is still unclear, please, write a unit test like I did
>above to show your issue.
>
>Regards,
>Olivier
>

The issue is in "3- gso must free the original mbuf", 
rte_pktmbuf_free(segmented_mbus) can't do it,
rte_gso_segment is impossible to do it, only feasible point is free_cb, please 
let me know if you have
a better way to free original mbuf and don't impact on segmented mbufs in PMD. 
My point is you must
have a place to call rte_pktmbuf_free(rogin_m) explicitly, otherwise it is 
impossible  to return it to memory
pool, please point out  where it can be called in my user scenario. I don't 
care how it is done, I just care it can
fix my issue, please don't hesitate and send me a patch if you can, thanks a 
lot.

>
>
>> free_cb must handle case 1 and case 2 in the same code, for case 1, caller_m 
>> is segmented new_m, for case 2, caller_m is orig_m.
>> 
>> loop in rte_gso_segement is handling original mbuf (this mbuf is multi-mbuf 
>> and includes multiple mbufs which are linked by next
>> pointer), it isn't a problem at all.
>> 
>> Please show me code how you can fix my issue if you don't change free_cb, 
>> thank you.
>> 
>> struct shinfo_arg {
>>        void *buf;
>>        struct rte_mbuf *mbuf;
>> };
>> 
>> virtio_dev_extbuf_free(struct rte_mbuf *caller_m, void *opaque)
>> {
>>        struct shinfo_arg *arg = (struct shinfo_arg *)opaque;
>> 
>>        rte_free(arg->buf);
>>        if (caller_m != arg->mbuf)
>>                rte_pktmbuf_free(arg->mbuf);
>>        rte_free(arg);
>> }







 

Reply via email to