Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-11 Thread Xuan Zhuo
On Thu, 11 Nov 2021 10:02:01 -0500, Michael S. Tsirkin  wrote:
> On Thu, Nov 11, 2021 at 02:52:07PM +0800, Xuan Zhuo wrote:
> > On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin  
> > wrote:
> > > On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > > > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > Hmm a bunch of comments got ignored. See e.g.
> > > > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > > > if they aren't relevant add code comments or commit log text 
> > > > > explaining the
> > > > > design choice please.
> > > >
> > > > I should have responded to related questions, I am guessing whether 
> > > > some emails
> > > > have been lost.
> > > >
> > > > I have sorted out the following 6 questions, if there are any missing 
> > > > questions,
> > > > please let me know.
> > > >
> > > > 1. use list_head
> > > >   In the earliest version, I used pointers directly. You suggest that I 
> > > > use
> > > >   llist_head, but considering that llist_head has atomic operations. 
> > > > There is no
> > > >   competition problem here, so I used list_head.
> > > >
> > > >   In fact, I did not increase the allocated space for list_head.
> > > >
> > > >   use as desc array: | vring_desc | vring_desc | vring_desc | 
> > > > vring_desc |
> > > >   use as queue item: | list_head 
> > > > |
> > >
> > > the concern is that you touch many cache lines when removing an entry.
> > >
> > > I suggest something like:
> > >
> > > llist: add a non-atomic list_del_first
> > >
> > > One has to know what one's doing, but if one has locked the list
> > > preventing all accesses, then it's ok to just pop off an entry without
> > > atomics.
> > >
> >
> > Oh, great, but my way of solving the problem is too conservative.
> >
> > > Signed-off-by: Michael S. Tsirkin 
> > >
> > > ---
> > >
> > > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > > index 24f207b0190b..13a47dddb12b 100644
> > > --- a/include/linux/llist.h
> > > +++ b/include/linux/llist.h
> > > @@ -247,6 +247,17 @@ static inline struct llist_node 
> > > *__llist_del_all(struct llist_head *head)
> > >
> > >  extern struct llist_node *llist_del_first(struct llist_head *head);
> > >
> > > +static inline struct llist_node *__llist_del_first(struct llist_head 
> > > *head)
> > > +{
> > > + struct llist_node *first = head->first;
> > > +
> > > + if (!first)
> > > + return NULL;
> > > +
> > > + head->first = first->next;
> > > + return first;
> > > +}
> > > +
> > >  struct llist_node *llist_reverse_order(struct llist_node *head);
> > >
> > >  #endif /* LLIST_H */
> > >
> > >
> > > -
> > >
> > >
> > > > 2.
> > > > > > +   if (vq->use_desc_cache && total_sg <= 
> > > > > > VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > > > +   if (vq->desc_cache_chain) {
> > > > > > +   desc = vq->desc_cache_chain;
> > > > > > +   vq->desc_cache_chain = (void *)desc->addr;
> > > > > > +   goto got;
> > > > > > +   }
> > > > > > +   n = VIRT_QUEUE_CACHE_DESC_NUM;
> > > > >
> > > > > Hmm. This will allocate more entries than actually used. Why do it?
> > > >
> > > >
> > > > This is because the size of each cache item is fixed, and the logic has 
> > > > been
> > > > modified in the latest code. I think this problem no longer exists.
> > > >
> > > >
> > > > 3.
> > > > > What bothers me here is what happens if cache gets
> > > > > filled on one numa node, then used on another?
> > > >
> > > > I'm thinking about another question, how did the cross-numa appear 
> > > > here, and
> > > > virtio desc queue also has the problem of cross-numa. So is it 
> > > > necessary for us
> > > > to deal with the cross-numa scene?
> > >
> > > It's true that desc queue might be cross numa, and people are looking
> > > for ways to improve that. Not a reason to make things worse ...
> > >
> >
> > I will test for it.
> >
> > >
> > > > Indirect desc is used as virtio desc, so as long as it is in the same 
> > > > numa as
> > > > virito desc. So we can allocate indirect desc cache at the same time 
> > > > when
> > > > allocating virtio desc queue.
> > >
> > > Using it from current node like we do now seems better.
> > >
> > > > 4.
> > > > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > > >
> > > > In the current version, desc cache is set up based on pre-queue.
> > > >
> > > > So if the desc cache is not used, we don't need to set the desc cache.
> > > >
> > > > For example, virtio-net, as long as the tx queue and the rx queue in 
> > > > big packet
> > > > mode enable desc cache.
> > >
> > >
> > > I liked how in older versions adding indrect enabled it implicitly
> > > though without need to hack drivers.
> >
> > I see.
> >
> > >
> > > > 5.
> > > > > Would a better API be a cache size in bytes? This controls how much
> > > > > memory is spent after all.
> > > >
> > > > My design is t

Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-11 Thread Michael S. Tsirkin
On Thu, Nov 11, 2021 at 02:52:07PM +0800, Xuan Zhuo wrote:
> On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin  
> wrote:
> > On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > Hmm a bunch of comments got ignored. See e.g.
> > > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > > if they aren't relevant add code comments or commit log text explaining 
> > > > the
> > > > design choice please.
> > >
> > > I should have responded to related questions, I am guessing whether some 
> > > emails
> > > have been lost.
> > >
> > > I have sorted out the following 6 questions, if there are any missing 
> > > questions,
> > > please let me know.
> > >
> > > 1. use list_head
> > >   In the earliest version, I used pointers directly. You suggest that I 
> > > use
> > >   llist_head, but considering that llist_head has atomic operations. 
> > > There is no
> > >   competition problem here, so I used list_head.
> > >
> > >   In fact, I did not increase the allocated space for list_head.
> > >
> > >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> > >   use as queue item: | list_head |
> >
> > the concern is that you touch many cache lines when removing an entry.
> >
> > I suggest something like:
> >
> > llist: add a non-atomic list_del_first
> >
> > One has to know what one's doing, but if one has locked the list
> > preventing all accesses, then it's ok to just pop off an entry without
> > atomics.
> >
> 
> Oh, great, but my way of solving the problem is too conservative.
> 
> > Signed-off-by: Michael S. Tsirkin 
> >
> > ---
> >
> > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > index 24f207b0190b..13a47dddb12b 100644
> > --- a/include/linux/llist.h
> > +++ b/include/linux/llist.h
> > @@ -247,6 +247,17 @@ static inline struct llist_node 
> > *__llist_del_all(struct llist_head *head)
> >
> >  extern struct llist_node *llist_del_first(struct llist_head *head);
> >
> > +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> > +{
> > +   struct llist_node *first = head->first;
> > +
> > +   if (!first)
> > +   return NULL;
> > +
> > +   head->first = first->next;
> > +   return first;
> > +}
> > +
> >  struct llist_node *llist_reverse_order(struct llist_node *head);
> >
> >  #endif /* LLIST_H */
> >
> >
> > -
> >
> >
> > > 2.
> > > > > + if (vq->use_desc_cache && total_sg <= 
> > > > > VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > > + if (vq->desc_cache_chain) {
> > > > > + desc = vq->desc_cache_chain;
> > > > > + vq->desc_cache_chain = (void *)desc->addr;
> > > > > + goto got;
> > > > > + }
> > > > > + n = VIRT_QUEUE_CACHE_DESC_NUM;
> > > >
> > > > Hmm. This will allocate more entries than actually used. Why do it?
> > >
> > >
> > > This is because the size of each cache item is fixed, and the logic has 
> > > been
> > > modified in the latest code. I think this problem no longer exists.
> > >
> > >
> > > 3.
> > > > What bothers me here is what happens if cache gets
> > > > filled on one numa node, then used on another?
> > >
> > > I'm thinking about another question, how did the cross-numa appear here, 
> > > and
> > > virtio desc queue also has the problem of cross-numa. So is it necessary 
> > > for us
> > > to deal with the cross-numa scene?
> >
> > It's true that desc queue might be cross numa, and people are looking
> > for ways to improve that. Not a reason to make things worse ...
> >
> 
> I will test for it.
> 
> >
> > > Indirect desc is used as virtio desc, so as long as it is in the same 
> > > numa as
> > > virito desc. So we can allocate indirect desc cache at the same time when
> > > allocating virtio desc queue.
> >
> > Using it from current node like we do now seems better.
> >
> > > 4.
> > > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > >
> > > In the current version, desc cache is set up based on pre-queue.
> > >
> > > So if the desc cache is not used, we don't need to set the desc cache.
> > >
> > > For example, virtio-net, as long as the tx queue and the rx queue in big 
> > > packet
> > > mode enable desc cache.
> >
> >
> > I liked how in older versions adding indrect enabled it implicitly
> > though without need to hack drivers.
> 
> I see.
> 
> >
> > > 5.
> > > > Would a better API be a cache size in bytes? This controls how much
> > > > memory is spent after all.
> > >
> > > My design is to set a threshold. When total_sg is greater than this 
> > > threshold,
> > > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > > this threshold, use the allocated cache.
> > >
> >
> > I know. My question is this, do devices know what a good threshold is?
> > If yes how do they know?
> 
> I think the driver knows the thres

Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-10 Thread Xuan Zhuo
On Wed, 10 Nov 2021 07:53:44 -0500, Michael S. Tsirkin  wrote:
> On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  
> > wrote:
> > >
> > > Hmm a bunch of comments got ignored. See e.g.
> > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > if they aren't relevant add code comments or commit log text explaining 
> > > the
> > > design choice please.
> >
> > I should have responded to related questions, I am guessing whether some 
> > emails
> > have been lost.
> >
> > I have sorted out the following 6 questions, if there are any missing 
> > questions,
> > please let me know.
> >
> > 1. use list_head
> >   In the earliest version, I used pointers directly. You suggest that I use
> >   llist_head, but considering that llist_head has atomic operations. There 
> > is no
> >   competition problem here, so I used list_head.
> >
> >   In fact, I did not increase the allocated space for list_head.
> >
> >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> >   use as queue item: | list_head |
>
> the concern is that you touch many cache lines when removing an entry.
>
> I suggest something like:
>
> llist: add a non-atomic list_del_first
>
> One has to know what one's doing, but if one has locked the list
> preventing all accesses, then it's ok to just pop off an entry without
> atomics.
>

Oh, great, but my way of solving the problem is too conservative.

> Signed-off-by: Michael S. Tsirkin 
>
> ---
>
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 24f207b0190b..13a47dddb12b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct 
> llist_head *head)
>
>  extern struct llist_node *llist_del_first(struct llist_head *head);
>
> +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> +{
> + struct llist_node *first = head->first;
> +
> + if (!first)
> + return NULL;
> +
> + head->first = first->next;
> + return first;
> +}
> +
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>
>  #endif /* LLIST_H */
>
>
> -
>
>
> > 2.
> > > > +   if (vq->use_desc_cache && total_sg <= 
> > > > VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > +   if (vq->desc_cache_chain) {
> > > > +   desc = vq->desc_cache_chain;
> > > > +   vq->desc_cache_chain = (void *)desc->addr;
> > > > +   goto got;
> > > > +   }
> > > > +   n = VIRT_QUEUE_CACHE_DESC_NUM;
> > >
> > > Hmm. This will allocate more entries than actually used. Why do it?
> >
> >
> > This is because the size of each cache item is fixed, and the logic has been
> > modified in the latest code. I think this problem no longer exists.
> >
> >
> > 3.
> > > What bothers me here is what happens if cache gets
> > > filled on one numa node, then used on another?
> >
> > I'm thinking about another question, how did the cross-numa appear here, and
> > virtio desc queue also has the problem of cross-numa. So is it necessary 
> > for us
> > to deal with the cross-numa scene?
>
> It's true that desc queue might be cross numa, and people are looking
> for ways to improve that. Not a reason to make things worse ...
>

I will test for it.

>
> > Indirect desc is used as virtio desc, so as long as it is in the same numa 
> > as
> > virito desc. So we can allocate indirect desc cache at the same time when
> > allocating virtio desc queue.
>
> Using it from current node like we do now seems better.
>
> > 4.
> > > So e.g. for rx, we are wasting memory since indirect isn't used.
> >
> > In the current version, desc cache is set up based on pre-queue.
> >
> > So if the desc cache is not used, we don't need to set the desc cache.
> >
> > For example, virtio-net, as long as the tx queue and the rx queue in big 
> > packet
> > mode enable desc cache.
>
>
> I liked how in older versions adding indrect enabled it implicitly
> though without need to hack drivers.

I see.

>
> > 5.
> > > Would a better API be a cache size in bytes? This controls how much
> > > memory is spent after all.
> >
> > My design is to set a threshold. When total_sg is greater than this 
> > threshold,
> > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > this threshold, use the allocated cache.
> >
>
> I know. My question is this, do devices know what a good threshold is?
> If yes how do they know?

I think the driver knows the threshold, for example, MAX_SKB_FRAG + 2 is a
suitable threshold for virtio-net.


>
> > 6. kmem_cache_*
> >
> > I have tested these, the performance is not as good as the method used in 
> > this
> > patch.
>
> Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> You mentioned just kmem_cache_alloc previously.


I will test for kmem_cache_alloc_bulk.

Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-10 Thread Michael S. Tsirkin
On Wed, Nov 10, 2021 at 07:53:49AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> > On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  
> > wrote:
> > >
> > > Hmm a bunch of comments got ignored. See e.g.
> > > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > > if they aren't relevant add code comments or commit log text explaining 
> > > the
> > > design choice please.
> > 
> > I should have responded to related questions, I am guessing whether some 
> > emails
> > have been lost.
> > 
> > I have sorted out the following 6 questions, if there are any missing 
> > questions,
> > please let me know.
> > 
> > 1. use list_head
> >   In the earliest version, I used pointers directly. You suggest that I use
> >   llist_head, but considering that llist_head has atomic operations. There 
> > is no
> >   competition problem here, so I used list_head.
> > 
> >   In fact, I did not increase the allocated space for list_head.
> > 
> >   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
> >   use as queue item: | list_head |
> 
> the concern is that you touch many cache lines when removing an entry.
> 
> I suggest something like:
> 
> llist: add a non-atomic list_del_first
> 
> One has to know what one's doing, but if one has locked the list
> preventing all accesses, then it's ok to just pop off an entry without
> atomics.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> 
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 24f207b0190b..13a47dddb12b 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct 
> llist_head *head)
>  
>  extern struct llist_node *llist_del_first(struct llist_head *head);
>  
> +static inline struct llist_node *__llist_del_first(struct llist_head *head)
> +{
> + struct llist_node *first = head->first;
> +
> + if (!first)
> + return NULL;
> +
> + head->first = first->next;
> + return first;
> +}
> +
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>  
>  #endif /* LLIST_H */
> 
> 
> -
> 
> 
> > 2.
> > > > +   if (vq->use_desc_cache && total_sg <= 
> > > > VIRT_QUEUE_CACHE_DESC_NUM) {
> > > > +   if (vq->desc_cache_chain) {
> > > > +   desc = vq->desc_cache_chain;
> > > > +   vq->desc_cache_chain = (void *)desc->addr;
> > > > +   goto got;
> > > > +   }
> > > > +   n = VIRT_QUEUE_CACHE_DESC_NUM;
> > >
> > > Hmm. This will allocate more entries than actually used. Why do it?
> > 
> > 
> > This is because the size of each cache item is fixed, and the logic has been
> > modified in the latest code. I think this problem no longer exists.
> > 
> > 
> > 3.
> > > What bothers me here is what happens if cache gets
> > > filled on one numa node, then used on another?
> > 
> > I'm thinking about another question, how did the cross-numa appear here, and
> > virtio desc queue also has the problem of cross-numa. So is it necessary 
> > for us
> > to deal with the cross-numa scene?
> 
> It's true that desc queue might be cross numa, and people are looking
> for ways to improve that. Not a reason to make things worse ...
> 

To add to that, given it's a concern, how about actually
testing performance for this config?

> > Indirect desc is used as virtio desc, so as long as it is in the same numa 
> > as
> > virito desc. So we can allocate indirect desc cache at the same time when
> > allocating virtio desc queue.
> 
> Using it from current node like we do now seems better.
> 
> > 4.
> > > So e.g. for rx, we are wasting memory since indirect isn't used.
> > 
> > In the current version, desc cache is set up based on pre-queue.
> > 
> > So if the desc cache is not used, we don't need to set the desc cache.
> > 
> > For example, virtio-net, as long as the tx queue and the rx queue in big 
> > packet
> > mode enable desc cache.
> 
> 
> I liked how in older versions adding indrect enabled it implicitly
> though without need to hack drivers.
> 
> > 5.
> > > Would a better API be a cache size in bytes? This controls how much
> > > memory is spent after all.
> > 
> > My design is to set a threshold. When total_sg is greater than this 
> > threshold,
> > it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> > this threshold, use the allocated cache.
> > 
> 
> I know. My question is this, do devices know what a good threshold is?
> If yes how do they know?
> 
> > 6. kmem_cache_*
> > 
> > I have tested these, the performance is not as good as the method used in 
> > this
> > patch.
> 
> Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
> You mentioned just kmem_cache_alloc previously.
> 
> > 
> > Thanks.

___
Virtualization mailing list
Virtualizatio

Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-10 Thread Michael S. Tsirkin
On Mon, Nov 08, 2021 at 10:47:40PM +0800, Xuan Zhuo wrote:
> On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  wrote:
> >
> > Hmm a bunch of comments got ignored. See e.g.
> > https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> > if they aren't relevant add code comments or commit log text explaining the
> > design choice please.
> 
> I should have responded to related questions, I am guessing whether some 
> emails
> have been lost.
> 
> I have sorted out the following 6 questions, if there are any missing 
> questions,
> please let me know.
> 
> 1. use list_head
>   In the earliest version, I used pointers directly. You suggest that I use
>   llist_head, but considering that llist_head has atomic operations. There is 
> no
>   competition problem here, so I used list_head.
> 
>   In fact, I did not increase the allocated space for list_head.
> 
>   use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
>   use as queue item: | list_head |

the concern is that you touch many cache lines when removing an entry.

I suggest something like:

llist: add a non-atomic list_del_first

One has to know what one's doing, but if one has locked the list
preventing all accesses, then it's ok to just pop off an entry without
atomics.

Signed-off-by: Michael S. Tsirkin 

---

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 24f207b0190b..13a47dddb12b 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -247,6 +247,17 @@ static inline struct llist_node *__llist_del_all(struct 
llist_head *head)
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
 
+static inline struct llist_node *__llist_del_first(struct llist_head *head)
+{
+   struct llist_node *first = head->first;
+
+   if (!first)
+   return NULL;
+
+   head->first = first->next;
+   return first;
+}
+
 struct llist_node *llist_reverse_order(struct llist_node *head);
 
 #endif /* LLIST_H */


-


> 2.
> > > + if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > > + if (vq->desc_cache_chain) {
> > > + desc = vq->desc_cache_chain;
> > > + vq->desc_cache_chain = (void *)desc->addr;
> > > + goto got;
> > > + }
> > > + n = VIRT_QUEUE_CACHE_DESC_NUM;
> >
> > Hmm. This will allocate more entries than actually used. Why do it?
> 
> 
> This is because the size of each cache item is fixed, and the logic has been
> modified in the latest code. I think this problem no longer exists.
> 
> 
> 3.
> > What bothers me here is what happens if cache gets
> > filled on one numa node, then used on another?
> 
> I'm thinking about another question, how did the cross-numa appear here, and
> virtio desc queue also has the problem of cross-numa. So is it necessary for 
> us
> to deal with the cross-numa scene?

It's true that desc queue might be cross numa, and people are looking
for ways to improve that. Not a reason to make things worse ...


> Indirect desc is used as virtio desc, so as long as it is in the same numa as
> virito desc. So we can allocate indirect desc cache at the same time when
> allocating virtio desc queue.

Using it from current node like we do now seems better.

> 4.
> > So e.g. for rx, we are wasting memory since indirect isn't used.
> 
> In the current version, desc cache is set up based on pre-queue.
> 
> So if the desc cache is not used, we don't need to set the desc cache.
> 
> For example, virtio-net, as long as the tx queue and the rx queue in big 
> packet
> mode enable desc cache.


I liked how in older versions adding indrect enabled it implicitly
though without need to hack drivers.

> 5.
> > Would a better API be a cache size in bytes? This controls how much
> > memory is spent after all.
> 
> My design is to set a threshold. When total_sg is greater than this threshold,
> it will fall back to kmalloc/kfree. When total_sg is less than or equal to
> this threshold, use the allocated cache.
> 

I know. My question is this, do devices know what a good threshold is?
If yes how do they know?

> 6. kmem_cache_*
> 
> I have tested these, the performance is not as good as the method used in this
> patch.

Do you mean kmem_cache_alloc_bulk/kmem_cache_free_bulk?
You mentioned just kmem_cache_alloc previously.

> 
> Thanks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-09 Thread Xuan Zhuo
On Tue, 9 Nov 2021 08:03:18 -0500, Michael S. Tsirkin  wrote:
> On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> > If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> > of sgs used for sending packets is greater than 1. We must constantly
> > call __kmalloc/kfree to allocate/release desc.
> >
> > In the case of extremely fast package delivery, the overhead cannot be
> > ignored:
> >
> >   27.46%  [kernel]  [k] virtqueue_add
> >   16.66%  [kernel]  [k] detach_buf_split
> >   16.51%  [kernel]  [k] virtnet_xsk_xmit
> >   14.04%  [kernel]  [k] virtqueue_add_outbuf
> >5.18%  [kernel]  [k] __kmalloc
> >4.08%  [kernel]  [k] kfree
> >2.80%  [kernel]  [k] virtqueue_get_buf_ctx
> >2.22%  [kernel]  [k] xsk_tx_peek_desc
> >2.08%  [kernel]  [k] memset_erms
> >0.83%  [kernel]  [k] virtqueue_kick_prepare
> >0.76%  [kernel]  [k] virtnet_xsk_run
> >0.62%  [kernel]  [k] __free_old_xmit_ptr
> >0.60%  [kernel]  [k] vring_map_one_sg
> >0.53%  [kernel]  [k] native_apic_mem_write
> >0.46%  [kernel]  [k] sg_next
> >0.43%  [kernel]  [k] sg_init_table
> >0.41%  [kernel]  [k] kmalloc_slab
> >
> > This patch adds a cache function to virtio to cache these allocated indirect
> > desc instead of constantly allocating and releasing desc.
> >
> > v4:
> > 1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation 
> > is successful
> > 2. The desc cache threshold can be set for each virtqueue
> >
> > v3:
> >   pre-allocate per buffer indirect descriptors array
>
> So I'm not sure why we are doing that. Did it improve anything?

1. Don't call kmalloc in the data path, this is not the point
2. Allocate memory on the cpu of the initialization device to ensure that the
indirect desc cache and vq desc are on the same numa.

Thanks.

>
>
> > v2:
> >   use struct list_head to cache the desc
> >
> > Xuan Zhuo (3):
> >   virtio: cache indirect desc for split
> >   virtio: cache indirect desc for packed
> >   virtio-net: enable virtio desc cache
> >
> >  drivers/net/virtio_net.c |  12 ++-
> >  drivers/virtio/virtio_ring.c | 152 +++
> >  include/linux/virtio.h   |  17 
> >  3 files changed, 163 insertions(+), 18 deletions(-)
> >
> > --
> > 2.31.0
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-09 Thread Michael S. Tsirkin
On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>5.18%  [kernel]  [k] __kmalloc
>4.08%  [kernel]  [k] kfree
>2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>2.22%  [kernel]  [k] xsk_tx_peek_desc
>2.08%  [kernel]  [k] memset_erms
>0.83%  [kernel]  [k] virtqueue_kick_prepare
>0.76%  [kernel]  [k] virtnet_xsk_run
>0.62%  [kernel]  [k] __free_old_xmit_ptr
>0.60%  [kernel]  [k] vring_map_one_sg
>0.53%  [kernel]  [k] native_apic_mem_write
>0.46%  [kernel]  [k] sg_next
>0.43%  [kernel]  [k] sg_init_table
>0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.
> 
> v4:
> 1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is 
> successful
> 2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array

So I'm not sure why we are doing that. Did it improve anything?


> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++
>  include/linux/virtio.h   |  17 
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-08 Thread Xuan Zhuo
On Mon, 8 Nov 2021 08:49:27 -0500, Michael S. Tsirkin  wrote:
>
> Hmm a bunch of comments got ignored. See e.g.
> https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
> if they aren't relevant add code comments or commit log text explaining the
> design choice please.

I should have responded to related questions, I am guessing whether some emails
have been lost.

I have sorted out the following 6 questions, if there are any missing questions,
please let me know.

1. use list_head
  In the earliest version, I used pointers directly. You suggest that I use
  llist_head, but considering that llist_head has atomic operations. There is no
  competition problem here, so I used list_head.

  In fact, I did not increase the allocated space for list_head.

  use as desc array: | vring_desc | vring_desc | vring_desc | vring_desc |
  use as queue item: | list_head |

2.
> > +   if (vq->use_desc_cache && total_sg <= VIRT_QUEUE_CACHE_DESC_NUM) {
> > +   if (vq->desc_cache_chain) {
> > +   desc = vq->desc_cache_chain;
> > +   vq->desc_cache_chain = (void *)desc->addr;
> > +   goto got;
> > +   }
> > +   n = VIRT_QUEUE_CACHE_DESC_NUM;
>
> Hmm. This will allocate more entries than actually used. Why do it?


This is because the size of each cache item is fixed, and the logic has been
modified in the latest code. I think this problem no longer exists.


3.
> What bothers me here is what happens if cache gets
> filled on one numa node, then used on another?

I'm thinking about another question, how did the cross-numa appear here, and
virtio desc queue also has the problem of cross-numa. So is it necessary for us
to deal with the cross-numa scene?

Indirect desc is used as virtio desc, so as long as it is in the same numa as
virito desc. So we can allocate indirect desc cache at the same time when
allocating virtio desc queue.

4.
> So e.g. for rx, we are wasting memory since indirect isn't used.

In the current version, desc cache is set up based on pre-queue.

So if the desc cache is not used, we don't need to set the desc cache.

For example, virtio-net, as long as the tx queue and the rx queue in big packet
mode enable desc cache.

5.
> Would a better API be a cache size in bytes? This controls how much
> memory is spent after all.

My design is to set a threshold. When total_sg is greater than this threshold,
it will fall back to kmalloc/kfree. When total_sg is less than or equal to
this threshold, use the allocated cache.


6. kmem_cache_*

I have tested these, the performance is not as good as the method used in this
patch.


Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/3] virtio support cache indirect desc

2021-11-08 Thread Michael S. Tsirkin
On Mon, Nov 08, 2021 at 07:49:48PM +0800, Xuan Zhuo wrote:
> If the VIRTIO_RING_F_INDIRECT_DESC negotiation succeeds, and the number
> of sgs used for sending packets is greater than 1. We must constantly
> call __kmalloc/kfree to allocate/release desc.
> 
> In the case of extremely fast package delivery, the overhead cannot be
> ignored:
> 
>   27.46%  [kernel]  [k] virtqueue_add
>   16.66%  [kernel]  [k] detach_buf_split
>   16.51%  [kernel]  [k] virtnet_xsk_xmit
>   14.04%  [kernel]  [k] virtqueue_add_outbuf
>5.18%  [kernel]  [k] __kmalloc
>4.08%  [kernel]  [k] kfree
>2.80%  [kernel]  [k] virtqueue_get_buf_ctx
>2.22%  [kernel]  [k] xsk_tx_peek_desc
>2.08%  [kernel]  [k] memset_erms
>0.83%  [kernel]  [k] virtqueue_kick_prepare
>0.76%  [kernel]  [k] virtnet_xsk_run
>0.62%  [kernel]  [k] __free_old_xmit_ptr
>0.60%  [kernel]  [k] vring_map_one_sg
>0.53%  [kernel]  [k] native_apic_mem_write
>0.46%  [kernel]  [k] sg_next
>0.43%  [kernel]  [k] sg_init_table
>0.41%  [kernel]  [k] kmalloc_slab
> 
> This patch adds a cache function to virtio to cache these allocated indirect
> desc instead of constantly allocating and releasing desc.

Hmm a bunch of comments got ignored. See e.g.
https://lore.kernel.org/r/20211027043851-mutt-send-email-mst%40kernel.org
if they aren't relevant add code comments or commit log text explaining the
design choice please.


> v4:
> 1. Only allow desc cache when VIRTIO_RING_F_INDIRECT_DESC negotiation is 
> successful
> 2. The desc cache threshold can be set for each virtqueue
> 
> v3:
>   pre-allocate per buffer indirect descriptors array
> 
> v2:
>   use struct list_head to cache the desc
> 
> Xuan Zhuo (3):
>   virtio: cache indirect desc for split
>   virtio: cache indirect desc for packed
>   virtio-net: enable virtio desc cache
> 
>  drivers/net/virtio_net.c |  12 ++-
>  drivers/virtio/virtio_ring.c | 152 +++
>  include/linux/virtio.h   |  17 
>  3 files changed, 163 insertions(+), 18 deletions(-)
> 
> --
> 2.31.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization