Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao
Hi Vivek,

On 1/6/2024 5:27 AM, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote:
>> On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
>>> On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
 On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
>> From: Hou Tao 
>>
>> When invoking virtio_fs_enqueue_req() through kworker, both the
>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>> Considering the size of both the sg array and the bounce buffer may be
>> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
>> possibility of memory allocation failure.
>>
> What's the practical benefit of this patch. Looks like if memory
> allocation fails, we keep retrying at interval of 1ms and don't
> return error to user space.

Motivation for GFP_NOFS comes another fix proposed for virtiofs [1] in
which when trying to insert a big kernel module kept in a cache-disabled
virtiofs, the length of fuse args will be large (e.g., 10MB), and the
memory allocation in copy_args_to_argbuf() will fail forever. The
proposed fix tries to fix the problem by limit the length of data kept
in fuse arg, but because the limitation is still large (256KB in that
patch), so I think using GFP_NOFS will also be helpful for such memory
allocation.

[1]:
https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/
 You don't deplete the atomic reserves unnecessarily?
>>> Sounds reasonable. 
>>>
>>> With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
>>> indefinitely till memory can be allocated. 
>> If you need the "loop indefinitely" behaviour, that's
>> GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
>> which can free up memory, this is a bad choice.  If you're just sleeping
>> and retrying, you might as well have the MM do that for you.

Even with GFP_NOFS, I think -ENOMEM is still possible, so the retry
logic is still necessary.
> I probably don't want to wait indefinitely. There might be some cases
> where I might want to return error to user space. For example, if
> virtiofs device has been hot-unplugged, then there is no point in
> waiting indefinitely for memory allocation. Even if memory was allocated,
> soon we will return error to user space with -ENOTCONN. 
>
> We are currently not doing that check after memory allocation failure but
> we probably could as an optimization.

Yes. It seems virtio_fs_enqueue_req() only checks fsvq->connected before
writing sg list to virtual queue, so if the virtio device is
hot-unplugged and the free memory is low, it may do unnecessary retry.
Even worse, it may hang. I volunteer to post a patch to check the
connected status after memory allocation failed if you are OK with that.

>
> So this patch looks good to me as it is. Thanks Hou Tao.
>
> Reviewed-by: Vivek Goyal 
>
> Thanks
> Vivek




Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao



On 1/6/2024 4:21 AM, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
>> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
>>> From: Hou Tao 
>>>
>>> When invoking virtio_fs_enqueue_req() through kworker, both the
>>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
>>> Considering the size of both the sg array and the bounce buffer may be
>>> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
>>> possibility of memory allocation failure.
>>>
>> What's the practical benefit of this patch. Looks like if memory
>> allocation fails, we keep retrying at interval of 1ms and don't
>> return error to user space.
> You don't deplete the atomic reserves unnecessarily?
Beside that, I think the proposed GFP_NOFS may reduce unnecessary
retries. I Should mention that in the commit message. Should I post a v3
to do that ?




Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states

2024-01-05 Thread Haitao Huang
On Fri, 05 Jan 2024 11:57:03 -0600, Dave Hansen   
wrote:



On 10/30/23 11:20, Haitao Huang wrote:
@@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct  
sgx_epc_page *page)

 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
spin_lock(&sgx_global_lru.lock);
-   if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
-   /* The page is being reclaimed. */
-   if (list_empty(&page->list)) {
-   spin_unlock(&sgx_global_lru.lock);
-   return -EBUSY;
-   }
-
-   list_del(&page->list);
-   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+   spin_unlock(&sgx_global_lru.lock);
+   return -EBUSY;
}
+
+   list_del(&page->list);
+   sgx_epc_page_reset_state(page);


I want to know how much if this series is basically line-for-line
abstraction shifting like:

-   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   sgx_epc_page_reset_state(page);

versus actually adding complexity.  That way, I might be able to offer
some advice on where this can be pared down.  That's really hard to do
with the current series.

Please don't just "introduce new page states".  This should have first
abstracted out the sgx_epc_page_reclaim_in_progress() operation, using
the list_empty() check as the implementation.

Then, in a separate patch, introduce the concept of the "reclaim in
progress" flag and finally flip the implementation over.

Ditto for the sgx_epc_page_reset_state() abstraction.  It should have
been introduced separately as a concept and then have the implementation
changed.

On in to patch 10 (which is much too big) which introduces the
sgx_lru_list() abstraction.



Sure. I'll try to refactor according to this plan.
Thanks
Haitao



Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events

2024-01-05 Thread Haitao Huang

On Fri, 05 Jan 2024 03:45:02 -0600, Michal Koutný  wrote:

On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang  
 wrote:

From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.


I notice now that the callbacks are per resource and per cgroup.
I think that is an overkill that complicates misc_cg allocation code
with parent's callbacks while freeing is done by own callbacks.
It's possibly too much liberal to keep objects' lifecycle understandable
in the long term too.

For some uniformity, I'd suggest struct blkcg_policy array in
block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum
misc_res_type instead of dynamic index assignment as
blkcg_policy_registeer() does.)

I hope you don't really need per-cgroup callbacks :-)

Michal


Sure I can do that. IIUC, you are suggesting something similar how  
capacity is set via misc_cg_set_capacity for each resource type. Here we  
make a misc_cg_set_ops() which stores the ops pointers in a static array,  
then use them indexed with the resource type.


Thanks
Haitao



Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
> > > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > > > From: Hou Tao 
> > > > > 
> > > > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > > > Considering the size of both the sg array and the bounce buffer may be
> > > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower 
> > > > > the
> > > > > possibility of memory allocation failure.
> > > > > 
> > > > 
> > > > What's the practical benefit of this patch. Looks like if memory
> > > > allocation fails, we keep retrying at interval of 1ms and don't
> > > > return error to user space.
> > > 
> > > You don't deplete the atomic reserves unnecessarily?
> > 
> > Sounds reasonable. 
> > 
> > With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
> > indefinitely till memory can be allocated. 
> 
> If you need the "loop indefinitely" behaviour, that's
> GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
> which can free up memory, this is a bad choice.  If you're just sleeping
> and retrying, you might as well have the MM do that for you.

I probably don't want to wait indefinitely. There might be some cases
where I might want to return error to user space. For example, if
virtiofs device has been hot-unplugged, then there is no point in
waiting indefinitely for memory allocation. Even if memory was allocated,
soon we will return error to user space with -ENOTCONN. 

We are currently not doing that check after memory allocation failure but
we probably could as an optimization.

So this patch looks good to me as it is. Thanks Hou Tao.

Reviewed-by: Vivek Goyal 

Thanks
Vivek




Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Matthew Wilcox
On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
> > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > > From: Hou Tao 
> > > > 
> > > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > > Considering the size of both the sg array and the bounce buffer may be
> > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > > > possibility of memory allocation failure.
> > > > 
> > > 
> > > What's the practical benefit of this patch. Looks like if memory
> > > allocation fails, we keep retrying at interval of 1ms and don't
> > > return error to user space.
> > 
> > You don't deplete the atomic reserves unnecessarily?
> 
> Sounds reasonable. 
> 
> With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
> indefinitely till memory can be allocated. 

If you need the "loop indefinitely" behaviour, that's
GFP_NOFS | __GFP_NOFAIL.  If you're actually doing something yourself
which can free up memory, this is a bad choice.  If you're just sleeping
and retrying, you might as well have the MM do that for you.



Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote:
> On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > > From: Hou Tao 
> > > 
> > > When invoking virtio_fs_enqueue_req() through kworker, both the
> > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > > Considering the size of both the sg array and the bounce buffer may be
> > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > > possibility of memory allocation failure.
> > > 
> > 
> > What's the practical benefit of this patch. Looks like if memory
> > allocation fails, we keep retrying at interval of 1ms and don't
> > return error to user space.
> 
> You don't deplete the atomic reserves unnecessarily?

Sounds reasonable. 

With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block
indefinitely till memory can be allocated. 

I am trying to figure out with GFP_NOFS, do we still need to check for
-ENOMEM while requeuing the req and asking worker thread to retry after
1ms. 

Thanks
Vivek




Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Matthew Wilcox
On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote:
> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> > From: Hou Tao 
> > 
> > When invoking virtio_fs_enqueue_req() through kworker, both the
> > allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> > Considering the size of both the sg array and the bounce buffer may be
> > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> > possibility of memory allocation failure.
> > 
> 
> What's the practical benefit of this patch. Looks like if memory
> allocation fails, we keep retrying at interval of 1ms and don't
> return error to user space.

You don't deplete the atomic reserves unnecessarily?



Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Vivek Goyal
On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> When invoking virtio_fs_enqueue_req() through kworker, both the
> allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
> Considering the size of both the sg array and the bounce buffer may be
> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
> possibility of memory allocation failure.
> 

What's the practical benefit of this patch. Looks like if memory
allocation fails, we keep retrying at interval of 1ms and don't
return error to user space.

Thanks
Vivek

> Signed-off-by: Hou Tao 
> ---
> Change log:
> v2:
>   * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by 
> Matthew)
> 
> v1: 
> https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com
> 
>  fs/fuse/virtio_fs.c | 20 +++-
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 3aac31d451985..8cf518624ce9e 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -87,7 +87,8 @@ struct virtio_fs_req_work {
>  };
>  
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -  struct fuse_req *req, bool in_flight);
> +  struct fuse_req *req, bool in_flight,
> +  gfp_t gfp);
>  
>  static const struct constant_table dax_param_enums[] = {
>   {"always",  FUSE_DAX_ALWAYS },
> @@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   list_del_init(&req->list);
>   spin_unlock(&fsvq->lock);
>  
> - ret = virtio_fs_enqueue_req(fsvq, req, true);
> + ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS);
>   if (ret < 0) {
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   spin_lock(&fsvq->lock);
> @@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
> work_struct *work)
>  }
>  
>  /* Allocate and copy args into req->argbuf */
> -static int copy_args_to_argbuf(struct fuse_req *req)
> +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
>  {
>   struct fuse_args *args = req->args;
>   unsigned int offset = 0;
> @@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
>   len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
> fuse_len_args(num_out, args->out_args);
>  
> - req->argbuf = kmalloc(len, GFP_ATOMIC);
> + req->argbuf = kmalloc(len, gfp);
>   if (!req->argbuf)
>   return -ENOMEM;
>  
> @@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct 
> scatterlist *sg,
>  
>  /* Add a request to a virtqueue and kick the device */
>  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
> -  struct fuse_req *req, bool in_flight)
> +  struct fuse_req *req, bool in_flight,
> +  gfp_t gfp)
>  {
>   /* requests need at least 4 elements */
>   struct scatterlist *stack_sgs[6];
> @@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
> *fsvq,
>   /* Does the sglist fit on the stack? */
>   total_sgs = sg_count_fuse_req(req);
>   if (total_sgs > ARRAY_SIZE(stack_sgs)) {
> - sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
> - sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
> + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
> + sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
>   if (!sgs || !sg) {
>   ret = -ENOMEM;
>   goto out;
> @@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
> *fsvq,
>   }
>  
>   /* Use a bounce buffer since stack args cannot be mapped */
> - ret = copy_args_to_argbuf(req);
> + ret = copy_args_to_argbuf(req, gfp);
>   if (ret < 0)
>   goto out;
>  
> @@ -1245,7 +1247,7 @@ __releases(fiq->lock)
>fuse_len_args(req->args->out_numargs, req->args->out_args));
>  
>   fsvq = &fs->vqs[queue_id];
> - ret = virtio_fs_enqueue_req(fsvq, req, false);
> + ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
>   if (ret < 0) {
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   /*
> -- 
> 2.29.2
> 




Re: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory

2024-01-05 Thread Haitao Huang
On Fri, 05 Jan 2024 12:29:05 -0600, Dave Hansen   
wrote:



There's very little about how the LRU design came to be in this cover
letter.  Let's add some details.

How's this?

Writing this up, I'm a lot more convinced that this series is, in
general, taking the right approach.  I honestly don't see any other
alternatives.  As much as I'd love to do something stupidly simple like
just killing enclaves at the moment they hit the limit, that would be a
horrid experience for users _and_ a departure from what the existing
reclaim support does.

That said, there's still a lot of work do to do refactor this series.
It's in need of some love to make it more clear what is going on and to
making the eventual switch over to per-cgroup LRUs more gradual.  Each
patch in the series is still doing way too much, _especially_ in patch  
10.


==

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed.  EPC
allocations can wait for reclaim.  The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the
same reasons: it's preferable to wait for memory to become available and
let reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list.  That list
(and the existing scanning algorithm) is essentially useless for doing
reclaim when a cgroup hits its limit because the cgroup's pages are
scattered around that LRU.  It is unspeakably inefficient to scan a
linked list with millions of entries for what could be dozens of pages
from a cgroup that needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU.  It
would either need to hold the list locks for unreasonable amounts of
time, or be taught to scan the list in pieces, which has its own  
challenges.


tl;dr: An cgroup hitting its limit should be as similar as possible to
the system running out of EPC memory.  The only two choices to implement
that are nasty changes the existing LRU scanning algorithm, or to add
new LRUs.  The result: Add a new LRU for each cgroup and scans those
instead.  Replace the existing global cgroup with the root cgroup's LRU
(only when this new support is compiled in, obviously).



I'll add this to the cover letter as a section justifying the LRU design  
for per-cgroup reclaiming.

Thank you very much.

Haitao



Re: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory

2024-01-05 Thread Dave Hansen
There's very little about how the LRU design came to be in this cover
letter.  Let's add some details.

How's this?

Writing this up, I'm a lot more convinced that this series is, in
general, taking the right approach.  I honestly don't see any other
alternatives.  As much as I'd love to do something stupidly simple like
just killing enclaves at the moment they hit the limit, that would be a
horrid experience for users _and_ a departure from what the existing
reclaim support does.

That said, there's still a lot of work do to do refactor this series.
It's in need of some love to make it more clear what is going on and to
making the eventual switch over to per-cgroup LRUs more gradual.  Each
patch in the series is still doing way too much, _especially_ in patch 10.

==

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed.  EPC
allocations can wait for reclaim.  The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the
same reasons: it's preferable to wait for memory to become available and
let reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list.  That list
(and the existing scanning algorithm) is essentially useless for doing
reclaim when a cgroup hits its limit because the cgroup's pages are
scattered around that LRU.  It is unspeakably inefficient to scan a
linked list with millions of entries for what could be dozens of pages
from a cgroup that needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU.  It
would either need to hold the list locks for unreasonable amounts of
time, or be taught to scan the list in pieces, which has its own challenges.

tl;dr: An cgroup hitting its limit should be as similar as possible to
the system running out of EPC memory.  The only two choices to implement
that are nasty changes the existing LRU scanning algorithm, or to add
new LRUs.  The result: Add a new LRU for each cgroup and scans those
instead.  Replace the existing global cgroup with the root cgroup's LRU
(only when this new support is compiled in, obviously).



Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states

2024-01-05 Thread Dave Hansen
On 10/30/23 11:20, Haitao Huang wrote:
> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page 
> *page)
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
>   spin_lock(&sgx_global_lru.lock);
> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> - /* The page is being reclaimed. */
> - if (list_empty(&page->list)) {
> - spin_unlock(&sgx_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(&page->list);
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(&sgx_global_lru.lock);
> + return -EBUSY;
>   }
> +
> + list_del(&page->list);
> + sgx_epc_page_reset_state(page);

I want to know how much if this series is basically line-for-line
abstraction shifting like:

-   page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+   sgx_epc_page_reset_state(page);

versus actually adding complexity.  That way, I might be able to offer
some advice on where this can be pared down.  That's really hard to do
with the current series.

Please don't just "introduce new page states".  This should have first
abstracted out the sgx_epc_page_reclaim_in_progress() operation, using
the list_empty() check as the implementation.

Then, in a separate patch, introduce the concept of the "reclaim in
progress" flag and finally flip the implementation over.

Ditto for the sgx_epc_page_reset_state() abstraction.  It should have
been introduced separately as a concept and then have the implementation
changed.

On in to patch 10 (which is much too big) which introduces the
sgx_lru_list() abstraction.



Re: [PATCH v5 22/34] tracing: Rename ftrace_regs_return_value to ftrace_regs_get_return_value

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:15:59PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Rename ftrace_regs_return_value to ftrace_regs_get_return_value as same as
> other ftrace_regs_get/set_* APIs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Acked-by: Mark Rutland 

Since this is a trivial cleanup, it might make sense to move this to the start
of the series, so that it can be queued even if the later parts need more work.

Mark.

> ---
>  Changes in v3:
>   - Newly added.
> ---
>  arch/loongarch/include/asm/ftrace.h |2 +-
>  arch/powerpc/include/asm/ftrace.h   |2 +-
>  arch/s390/include/asm/ftrace.h  |2 +-
>  arch/x86/include/asm/ftrace.h   |2 +-
>  include/linux/ftrace.h  |2 +-
>  kernel/trace/fgraph.c   |2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/loongarch/include/asm/ftrace.h 
> b/arch/loongarch/include/asm/ftrace.h
> index a11996eb5892..a9c3d0f2f941 100644
> --- a/arch/loongarch/include/asm/ftrace.h
> +++ b/arch/loongarch/include/asm/ftrace.h
> @@ -70,7 +70,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
> *fregs, unsigned long ip)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 9e5a39b6a311..7e138e0e3baf 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -69,7 +69,7 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs 
> *fregs)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
> index 5a82b08f03cd..01e775c98425 100644
> --- a/arch/s390/include/asm/ftrace.h
> +++ b/arch/s390/include/asm/ftrace.h
> @@ -88,7 +88,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs 
> *fregs,
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 0b306c82855d..a061f8832b20 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -64,7 +64,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
>   regs_get_kernel_argument(&(fregs)->regs, n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(&(fregs)->regs)
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(&(fregs)->regs)
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(&(fregs)->regs, ret)
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 79875a00c02b..da2a23f5a9ed 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -187,7 +187,7 @@ static __always_inline bool ftrace_regs_has_args(struct 
> ftrace_regs *fregs)
>   regs_get_kernel_argument(ftrace_get_regs(fregs), n)
>  #define ftrace_regs_get_stack_pointer(fregs) \
>   kernel_stack_pointer(ftrace_get_regs(fregs))
> -#define ftrace_regs_return_value(fregs) \
> +#define ftrace_regs_get_return_value(fregs) \
>   regs_return_value(ftrace_get_regs(fregs))
>  #define ftrace_regs_set_return_value(fregs, ret) \
>   regs_set_return_value(ftrace_get_regs(fregs), ret)
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 088432b695a6..9a60acaacc96 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -783,7 +783,7 @@ static void fgraph_call_retfunc(struct ftrace_regs *fregs,
>   trace.rettime = trace_clock_local();
>  #ifdef CONFIG_FUNCTION_GRAPH_RETVAL
>   if (fregs)
> - trace.retval = ftrace_regs_return_value(fregs);
> + trace.retval = ftrace_regs_get_return_value(fregs);
>   else
>   trace.retval = fgraph_ret_regs_return_value(ret_regs);
>  #endif
> 



Re: [PATCH v5 01/34] tracing: Add a comment about ftrace_regs definition

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:11:44PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> To clarify what will be expected on ftrace_regs, add a comment to the
> architecture independent definition of the ftrace_regs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 

Acked-by: Mark Rutland 

Mark.

> ---
>  Changes in v3:
>   - Add instruction pointer
>  Changes in v2:
>   - newly added.
> ---
>  include/linux/ftrace.h |   26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e8921871ef9a..8b48fc621ea0 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -118,6 +118,32 @@ extern int ftrace_enabled;
>  
>  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>  
> +/**
> + * ftrace_regs - ftrace partial/optimal register set
> + *
> + * ftrace_regs represents a group of registers which is used at the
> + * function entry and exit. There are three types of registers.
> + *
> + * - Registers for passing the parameters to callee, including the stack
> + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> + * - Registers for passing the return values to caller.
> + *   (e.g. rax and rdx on x86_64)
> + * - Registers for hooking the function call and return including the
> + *   frame pointer (the frame pointer is architecture/config dependent)
> + *   (e.g. rip, rbp and rsp for x86_64)
> + *
> + * Also, architecture dependent fields can be used for internal process.
> + * (e.g. orig_ax on x86_64)
> + *
> + * On the function entry, those registers will be restored except for
> + * the stack pointer, so that user can change the function parameters
> + * and instruction pointer (e.g. live patching.)
> + * On the function exit, only registers which is used for return values
> + * are restored.
> + *
> + * NOTE: user *must not* access regs directly, only do it via APIs, because
> + * the member can be changed according to the architecture.
> + */
>  struct ftrace_regs {
>   struct pt_regs  regs;
>  };
> 



Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering

2024-01-05 Thread Mark Rutland
On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) wrote:
> From: Steven Rostedt (VMware) 
> 
> Allow for instances to have their own ftrace_ops part of the fgraph_ops
> that makes the funtion_graph tracer filter on the set_ftrace_filter file
> of the instance and not the top instance.
> 
> This also change how the function_graph handles multiple instances on the
> shadow stack. Previously we use ARRAY type entries to record which one
> is enabled, and this makes it a bitmap of the fgraph_array's indexes.
> Previous function_graph_enter() expects calling back from
> prepare_ftrace_return() function which is called back only once if it is
> enabled. But this introduces different ftrace_ops for each fgraph
> instance and those are called from ftrace_graph_func() one by one. Thus
> we can not loop on the fgraph_array(), and need to reuse the ret_stack
> pushed by the previous instance. Finding the ret_stack is easy because
> we can check the ret_stack->func. But that is not enough for the self-
> recursive tail-call case. Thus fgraph uses the bitmap entry to find it
> is already set (this means that entry is for previous tail call).
> 
> Signed-off-by: Steven Rostedt (VMware) 
> Signed-off-by: Masami Hiramatsu (Google) 

As a heads-up, while testing the topic/fprobe-on-fgraph branch on arm64, I get
a warning which bisets down to this commit:

| Testing tracer function_graph: 
| [ cut here ]
| WARNING: CPU: 2 PID: 0 at arch/arm64/kernel/stacktrace.c:84 
arch_stack_walk+0x3c0/0x3d8
| Modules linked in:
| CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc2-00026-gea1e68a341c2 #12
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x3c0/0x3d8
| lr : arch_stack_walk+0x260/0x3d8
| sp : 80008318be00
| x29: 80008318be00 x28: 03c0ae80 x27: 
| x26:  x25: 03c0ae80 x24: 
| x23: 8000800234c8 x22: 80008002dc30 x21: 800080035d10
| x20: 80008318bee8 x19: 800080023460 x18: 800083453c68
| x17:  x16: 800083188000 x15: 8ccc5058
| x14: 0004 x13: 800082b8c4f0 x12: 
| x11: 800081fba9b0 x10: 80008318bff0 x9 : 800080010798
| x8 : 80008002dc30 x7 : 03c0ae80 x6 : 
| x5 :  x4 : 8000832a3c18 x3 : 80008318bff0
| x2 : 80008002dc30 x1 : 80008002dc30 x0 : 80008002dc30
| Call trace:
|  arch_stack_walk+0x3c0/0x3d8
|  return_address+0x40/0x80
|  trace_hardirqs_on+0x8c/0x198
|  __do_softirq+0xe8/0x440
| ---[ end trace  ]---

That's a warning in arm64's unwind_recover_return_address() function, which
fires when ftrace_graph_ret_addr() finds return_to_handler:

if (state->task->ret_stack &&
(state->pc == (unsigned long)return_to_handler)) {
unsigned long orig_pc;
orig_pc = ftrace_graph_ret_addr(state->task, NULL, state->pc,
(void *)state->fp);
if (WARN_ON_ONCE(state->pc == orig_pc))
return -EINVAL;
state->pc = orig_pc;
}

The rationale there is that since tail calls are (currently) disabled on arm64,
the only reason for ftrace_graph_ret_addr() to return return_to_handler is when
it fails to find the original return address.

Does this change make it legitimate for ftrace_graph_ret_addr() to return
return_to_handler in other cases, or is that a bug?

Either way, we'll need *some* way to recover the original return addresss...

Mark.

> ---
>  Changes in v4:
>   - Simplify get_ret_stack() sanity check and use WARN_ON_ONCE() for
> obviously wrong value.
>   - Do not check ret == return_to_handler but always read the previous
> ret_stack in ftrace_push_return_trace() to check it is reusable.
>   - Set the bit 0 of the bitmap entry always in function_graph_enter()
> because it uses bit 0 to check re-usability.
>   - Fix to ensure the ret_stack entry is bitmap type when checking the
> bitmap.
>  Changes in v3:
>   - Pass current fgraph_ops to the new entry handler
>(function_graph_enter_ops) if fgraph use ftrace.
>   - Add fgraph_ops::idx in this patch.
>   - Replace the array type with the bitmap type so that it can record
> which fgraph is called.
>   - Fix some helper function to use passed task_struct instead of current.
>   - Reduce the ret-index size to 1024 words.
>   - Make the ret-index directly points the ret_stack.
>   - Fix ftrace_graph_ret_addr() to handle tail-call case correctly.
>  Changes in v2:
>   - Use ftrace_graph_func and FTRACE_OPS_GRAPH_STUB instead of
> ftrace_stub and FTRACE_OPS_FL_STUB for new ftrace based fgraph.
> ---
>  arch/arm64/kernel/ftrace.c   |   19 ++
>  arch/x86/kernel/ftrace.c |   19 ++
>  include/linux/ftrace.h   |7 +
>  kernel/tra

Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> The page frag in vhost_net_page_frag_refill() uses the
> 'struct page_frag' from skb_page_frag_refill(), but it's
> implementation is similar to page_frag_alloc_align() now.
> 
> This patch removes vhost_net_page_frag_refill() by using
> 'struct page_frag_cache' instead of 'struct page_frag',
> and allocating frag using page_frag_alloc_align().
> 
> The added benefit is that not only unifying the page frag
> implementation a little, but also having about 0.5% performance
> boost testing by using the vhost_net_test introduced in the
> last patch.
> 
> Signed-off-by: Yunsheng Lin 
> Acked-by: Jason Wang 
> ---
>  drivers/vhost/net.c | 93 ++---
>  1 file changed, 29 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index e574e21cc0ca..805e11d598e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -141,10 +141,8 @@ struct vhost_net {
>   unsigned tx_zcopy_err;
>   /* Flush in progress. Protected by tx vq lock. */
>   bool tx_flush;
> - /* Private page frag */
> - struct page_frag page_frag;
> - /* Refcount bias of page frag */
> - int refcnt_bias;
> + /* Private page frag cache */
> + struct page_frag_cache pf_cache;
>  };
>  
>  static unsigned vhost_net_zcopy_mask __read_mostly;
> @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, 
> size_t total_len)
>  !vhost_vq_avail_empty(vq->dev, vq);
>  }
>  
> -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int 
> sz,
> -struct page_frag *pfrag, gfp_t gfp)
> -{
> - if (pfrag->page) {
> - if (pfrag->offset + sz <= pfrag->size)
> - return true;
> - __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
> - }
> -
> - pfrag->offset = 0;
> - net->refcnt_bias = 0;
> - if (SKB_FRAG_PAGE_ORDER) {
> - /* Avoid direct reclaim but allow kswapd to wake */
> - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> -   __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY | __GFP_NOMEMALLOC,
> -   SKB_FRAG_PAGE_ORDER);
> - if (likely(pfrag->page)) {
> - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> - goto done;
> - }
> - }
> - pfrag->page = alloc_page(gfp);
> - if (likely(pfrag->page)) {
> - pfrag->size = PAGE_SIZE;
> - goto done;
> - }
> - return false;
> -
> -done:
> - net->refcnt_bias = USHRT_MAX;
> - page_ref_add(pfrag->page, USHRT_MAX - 1);
> - return true;
> -}
> -
>  #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
>  
>  static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
>   struct vhost_net *net = container_of(vq->dev, struct vhost_net,
>dev);
>   struct socket *sock = vhost_vq_get_backend(vq);
> - struct page_frag *alloc_frag = &net->page_frag;
>   struct virtio_net_hdr *gso;
>   struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>   struct tun_xdp_hdr *hdr;
> @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
>   int sock_hlen = nvq->sock_hlen;
>   void *buf;
>   int copied;
> + int ret;
>  
>   if (unlikely(len < nvq->sock_hlen))
>   return -EFAULT;
> @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct 
> vhost_net_virtqueue *nvq,
>   return -ENOSPC;
>  
>   buflen += SKB_DATA_ALIGN(len + pad);
> - alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> - if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> -  alloc_frag, GFP_KERNEL)))
> + buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
> + SMP_CACHE_BYTES);

If your changes from patch 1 are just to make it fit into this layout
might I suggest just splitting up page_frag_alloc_align into an inline
that accepts the arguments you have here, and adding
__page_frag_alloc_align which is passed the mask the original function
expected. By doing that you should be able to maintain the same level
of performance and still get most of the code cleanup.

> + if (unlikely(!buf))
>   return -ENOMEM;
>  
> - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> - copied = copy_page_from_iter(alloc_frag->page,
> -  alloc_frag->offset +
> -  offsetof(struct tun_xdp_hdr, gso),
> -  sock_hlen, from);
> - if (copied != soc

[PATCH v2 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec

2024-01-05 Thread Ulf Hansson
Let's avoid some of the boilerplate code to manage the vcodec PM domains,
by converting into using dev_pm_domain_attach|detach_list().

Cc: Mauro Carvalho Chehab 
Cc: Stanimir Varbanov 
Cc: Vikash Garodia 
Cc: "Bryan O'Donoghue" 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: 
Tested-by: Bryan O'Donoghue 
Reviewed-by: Bryan O'Donoghue 
Signed-off-by: Ulf Hansson 
---

Changes in v2:
- Added tags Bryan's tags.

---
 drivers/media/platform/qcom/venus/core.c  | 12 +++--
 drivers/media/platform/qcom/venus/core.h  |  7 ++-
 .../media/platform/qcom/venus/pm_helpers.c| 48 +++
 3 files changed, 26 insertions(+), 41 deletions(-)

diff --git a/drivers/media/platform/qcom/venus/core.c 
b/drivers/media/platform/qcom/venus/core.c
index 9cffe975581b..bd9b474280e4 100644
--- a/drivers/media/platform/qcom/venus/core.c
+++ b/drivers/media/platform/qcom/venus/core.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,7 +115,8 @@ static void venus_sys_error_handler(struct work_struct 
*work)
pm_runtime_put_sync(core->dev);
 
for (i = 0; i < max_attempts; i++) {
-   if (!core->pmdomains[0] || 
!pm_runtime_active(core->pmdomains[0]))
+   if (!core->pmdomains ||
+   !pm_runtime_active(core->pmdomains->pd_devs[0]))
break;
usleep_range(1000, 1500);
}
@@ -705,7 +707,7 @@ static const struct venus_resources sdm845_res_v2 = {
.vcodec0_clks = { "vcodec0_core", "vcodec0_bus" },
.vcodec1_clks = { "vcodec1_core", "vcodec1_bus" },
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0", "vcodec1" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0", "vcodec1" },
.vcodec_pmdomains_num = 3,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 2,
@@ -754,7 +756,7 @@ static const struct venus_resources sc7180_res = {
.clks_num = 3,
.vcodec0_clks = { "vcodec0_core", "vcodec0_bus" },
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 1,
@@ -811,7 +813,7 @@ static const struct venus_resources sm8250_res = {
.resets_num = 2,
.vcodec0_clks = { "vcodec0_core" },
.vcodec_clks_num = 1,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "mx", NULL },
.vcodec_num = 1,
@@ -870,7 +872,7 @@ static const struct venus_resources sc7280_res = {
.clks_num = 3,
.vcodec0_clks = {"vcodec_core", "vcodec_bus"},
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 1,
diff --git a/drivers/media/platform/qcom/venus/core.h 
b/drivers/media/platform/qcom/venus/core.h
index 4a633261ece4..7ef341bf21cc 100644
--- a/drivers/media/platform/qcom/venus/core.h
+++ b/drivers/media/platform/qcom/venus/core.h
@@ -25,7 +25,6 @@
 
 #define VIDC_CLKS_NUM_MAX  4
 #define VIDC_VCODEC_CLKS_NUM_MAX   2
-#define VIDC_PMDOMAINS_NUM_MAX 3
 #define VIDC_RESETS_NUM_MAX2
 
 extern int venus_fw_debug;
@@ -72,7 +71,7 @@ struct venus_resources {
const char * const vcodec0_clks[VIDC_VCODEC_CLKS_NUM_MAX];
const char * const vcodec1_clks[VIDC_VCODEC_CLKS_NUM_MAX];
unsigned int vcodec_clks_num;
-   const char * const vcodec_pmdomains[VIDC_PMDOMAINS_NUM_MAX];
+   const char **vcodec_pmdomains;
unsigned int vcodec_pmdomains_num;
const char **opp_pmdomain;
unsigned int vcodec_num;
@@ -134,7 +133,7 @@ struct venus_format {
  * @video_path: an interconnect handle to video to/from memory path
  * @cpucfg_path: an interconnect handle to cpu configuration path
  * @has_opp_table: does OPP table exist
- * @pmdomains: an array of pmdomains struct device pointers
+ * @pmdomains: a pointer to a list of pmdomains
  * @opp_dl_venus: an device-link for device OPP
  * @opp_pmdomain: an OPP power-domain
  * @resets: an array of reset signals
@@ -187,7 +186,7 @@ struct venus_core {
struct icc_path *video_path;
struct icc_path *cpucfg_path;
bool has_opp_table;
-   struct device *pmdomains[VIDC_PMDOMAINS_NUM_MAX];
+   struct dev_pm_domain_list *pmdomains;
struct device_link *opp_dl_venus;
struct device *opp_pmdomain;
struct reset_control *resets[VIDC_RESETS_NUM_MAX];
diff --git a/drivers/media/platform/qcom/venus/pm_helpers.c 
b/drivers/media/platform/qcom/venus/pm_helpers.c
i

[PATCH v2 4/5] remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list()

2024-01-05 Thread Ulf Hansson
Let's avoid some of the boilerplate code to manage the various PM domain
cases, by converting into using dev_pm_domain_attach|detach_list().

As a part of the conversion, we are moving over to use device_links, which
simplifies the runtime PM support too. Moreover, while attaching let's
trust that an already attached single PM domain is the correct one.

Cc: Mathieu Poirier 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: 
Signed-off-by: Ulf Hansson 
---

Changes in v2:
- None.

Kind regards
Ulf Hansson
---
 drivers/remoteproc/qcom_q6v5_adsp.c | 160 +---
 1 file changed, 73 insertions(+), 87 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c 
b/drivers/remoteproc/qcom_q6v5_adsp.c
index 6c67514cc493..93f9a1537ec6 100644
--- a/drivers/remoteproc/qcom_q6v5_adsp.c
+++ b/drivers/remoteproc/qcom_q6v5_adsp.c
@@ -55,8 +55,6 @@
 #define QDSP6SS_CORE_CBCR  0x20
 #define QDSP6SS_SLEEP_CBCR 0x3c
 
-#define QCOM_Q6V5_RPROC_PROXY_PD_MAX   3
-
 #define LPASS_BOOT_CORE_START  BIT(0)
 #define LPASS_BOOT_CMD_START   BIT(0)
 #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0
@@ -74,7 +72,8 @@ struct adsp_pil_data {
 
const char **clk_ids;
int num_clks;
-   const char **proxy_pd_names;
+   const char **pd_names;
+   unsigned int num_pds;
const char *load_state;
 };
 
@@ -110,8 +109,7 @@ struct qcom_adsp {
size_t mem_size;
bool has_iommu;
 
-   struct device *proxy_pds[QCOM_Q6V5_RPROC_PROXY_PD_MAX];
-   size_t proxy_pd_count;
+   struct dev_pm_domain_list *pd_list;
 
struct qcom_rproc_glink glink_subdev;
struct qcom_rproc_ssr ssr_subdev;
@@ -120,98 +118,92 @@ struct qcom_adsp {
int (*shutdown)(struct qcom_adsp *adsp);
 };
 
-static int qcom_rproc_pds_attach(struct device *dev, struct qcom_adsp *adsp,
-const char **pd_names)
+static int qcom_rproc_pds_attach(struct qcom_adsp *adsp, const char **pd_names,
+unsigned int num_pds)
 {
-   struct device **devs = adsp->proxy_pds;
-   size_t num_pds = 0;
+   struct device *dev = adsp->dev;
+   struct dev_pm_domain_attach_data pd_data = {
+   .pd_names = pd_names,
+   .num_pd_names = num_pds,
+   };
int ret;
-   int i;
-
-   if (!pd_names)
-   return 0;
 
/* Handle single power domain */
-   if (dev->pm_domain) {
-   devs[0] = dev;
-   pm_runtime_enable(dev);
-   return 1;
-   }
+   if (dev->pm_domain)
+   goto out;
 
-   while (pd_names[num_pds])
-   num_pds++;
+   if (!pd_names)
+   return 0;
 
-   if (num_pds > ARRAY_SIZE(adsp->proxy_pds))
-   return -E2BIG;
+   ret = dev_pm_domain_attach_list(dev, &pd_data, &adsp->pd_list);
+   if (ret < 0)
+   return ret;
 
-   for (i = 0; i < num_pds; i++) {
-   devs[i] = dev_pm_domain_attach_by_name(dev, pd_names[i]);
-   if (IS_ERR_OR_NULL(devs[i])) {
-   ret = PTR_ERR(devs[i]) ? : -ENODATA;
-   goto unroll_attach;
-   }
-   }
+out:
+   pm_runtime_enable(dev);
+   return 0;
+}
 
-   return num_pds;
+static void qcom_rproc_pds_detach(struct qcom_adsp *adsp)
+{
+   struct device *dev = adsp->dev;
+   struct dev_pm_domain_list *pds = adsp->pd_list;
 
-unroll_attach:
-   for (i--; i >= 0; i--)
-   dev_pm_domain_detach(devs[i], false);
+   dev_pm_domain_detach_list(pds);
 
-   return ret;
+   if (dev->pm_domain || pds)
+   pm_runtime_disable(adsp->dev);
 }
 
-static void qcom_rproc_pds_detach(struct qcom_adsp *adsp, struct device **pds,
- size_t pd_count)
+static int qcom_rproc_pds_enable(struct qcom_adsp *adsp)
 {
struct device *dev = adsp->dev;
-   int i;
+   struct dev_pm_domain_list *pds = adsp->pd_list;
+   int ret, i = 0;
 
-   /* Handle single power domain */
-   if (dev->pm_domain && pd_count) {
-   pm_runtime_disable(dev);
-   return;
-   }
+   if (!dev->pm_domain && !pds)
+   return 0;
 
-   for (i = 0; i < pd_count; i++)
-   dev_pm_domain_detach(pds[i], false);
-}
+   if (dev->pm_domain)
+   dev_pm_genpd_set_performance_state(dev, INT_MAX);
 
-static int qcom_rproc_pds_enable(struct qcom_adsp *adsp, struct device **pds,
-size_t pd_count)
-{
-   int ret;
-   int i;
-
-   for (i = 0; i < pd_count; i++) {
-   dev_pm_genpd_set_performance_state(pds[i], INT_MAX);
-   ret = pm_runtime_resume_and_get(pds[i]);
-   if (ret < 0) {
-   dev_pm_genpd_set_performance_state(pds[i], 0);
-   goto unroll_pd_votes;
-   }
+   while (pds && i < pds->num_pds) {
+

[PATCH v2 3/5] remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()

2024-01-05 Thread Ulf Hansson
Let's avoid the boilerplate code to manage the multiple PM domain case, by
converting into using dev_pm_domain_attach|detach_list().

Cc: Mathieu Poirier 
Cc: Bjorn Andersson 
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Iuliana Prodan 
Cc: Daniel Baluta 
Cc: 
Signed-off-by: Ulf Hansson 
---

Changes in v2:
- None.

Iuliana/Daniel I am ccing you to request help with test/review of this change.
Note that, you will need patch 1/5 in the series too, to be able to test this.

Kind regards
Ulf Hansson

---
 drivers/remoteproc/imx_rproc.c | 73 +-
 1 file changed, 9 insertions(+), 64 deletions(-)

diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c
index 8bb293b9f327..3161f14442bc 100644
--- a/drivers/remoteproc/imx_rproc.c
+++ b/drivers/remoteproc/imx_rproc.c
@@ -92,7 +92,6 @@ struct imx_rproc_mem {
 
 static int imx_rproc_xtr_mbox_init(struct rproc *rproc);
 static void imx_rproc_free_mbox(struct rproc *rproc);
-static int imx_rproc_detach_pd(struct rproc *rproc);
 
 struct imx_rproc {
struct device   *dev;
@@ -113,10 +112,8 @@ struct imx_rproc {
u32 rproc_pt;   /* partition id */
u32 rsrc_id;/* resource id */
u32 entry;  /* cpu start address */
-   int num_pd;
u32 core_index;
-   struct device   **pd_dev;
-   struct device_link  **pd_dev_link;
+   struct dev_pm_domain_list   *pd_list;
 };
 
 static const struct imx_rproc_att imx_rproc_att_imx93[] = {
@@ -853,7 +850,7 @@ static void imx_rproc_put_scu(struct rproc *rproc)
return;
 
if (imx_sc_rm_is_resource_owned(priv->ipc_handle, priv->rsrc_id)) {
-   imx_rproc_detach_pd(rproc);
+   dev_pm_domain_detach_list(priv->pd_list);
return;
}
 
@@ -880,72 +877,20 @@ static int imx_rproc_partition_notify(struct 
notifier_block *nb,
 static int imx_rproc_attach_pd(struct imx_rproc *priv)
 {
struct device *dev = priv->dev;
-   int ret, i;
-
-   /*
-* If there is only one power-domain entry, the platform driver 
framework
-* will handle it, no need handle it in this driver.
-*/
-   priv->num_pd = of_count_phandle_with_args(dev->of_node, "power-domains",
- "#power-domain-cells");
-   if (priv->num_pd <= 1)
-   return 0;
-
-   priv->pd_dev = devm_kmalloc_array(dev, priv->num_pd, 
sizeof(*priv->pd_dev), GFP_KERNEL);
-   if (!priv->pd_dev)
-   return -ENOMEM;
-
-   priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_pd, 
sizeof(*priv->pd_dev_link),
-  GFP_KERNEL);
-
-   if (!priv->pd_dev_link)
-   return -ENOMEM;
-
-   for (i = 0; i < priv->num_pd; i++) {
-   priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i);
-   if (IS_ERR(priv->pd_dev[i])) {
-   ret = PTR_ERR(priv->pd_dev[i]);
-   goto detach_pd;
-   }
-
-   priv->pd_dev_link[i] = device_link_add(dev, priv->pd_dev[i], 
DL_FLAG_STATELESS |
-  DL_FLAG_PM_RUNTIME | 
DL_FLAG_RPM_ACTIVE);
-   if (!priv->pd_dev_link[i]) {
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   ret = -EINVAL;
-   goto detach_pd;
-   }
-   }
-
-   return 0;
-
-detach_pd:
-   while (--i >= 0) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return ret;
-}
-
-static int imx_rproc_detach_pd(struct rproc *rproc)
-{
-   struct imx_rproc *priv = rproc->priv;
-   int i;
+   int ret;
+   struct dev_pm_domain_attach_data pd_data = {
+   .pd_flags = PD_FLAG_DEV_LINK_ON,
+   };
 
/*
 * If there is only one power-domain entry, the platform driver 
framework
 * will handle it, no need handle it in this driver.
 */
-   if (priv->num_pd <= 1)
+   if (dev->pm_domain)
return 0;
 
-   for (i = 0; i < priv->num_pd; i++) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return 0;
+   ret = dev_pm_domain_attach_list(dev, &pd_data, &priv->pd_list);
+   return ret < 0 ? ret : 0;
 }
 
 static int imx_rproc_detect_mode(struct imx_rproc *priv)
-- 
2.34.1




[PATCH v2 2/5] remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list()

2024-01-05 Thread Ulf Hansson
Let's avoid the boilerplate code to manage the multiple PM domain case, by
converting into using dev_pm_domain_attach|detach_list().

Cc: Mathieu Poirier 
Cc: Bjorn Andersson 
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Iuliana Prodan 
Cc: Daniel Baluta 
Cc: 
Signed-off-by: Ulf Hansson 
---

Changes in v2:
- None.

Iuliana/Daniel I am ccing you to request help with test/review of this change.
Note that, you will need patch 1/5 in the series too, to be able to test this.

Kind regards
Ulf Hansson

---
 drivers/remoteproc/imx_dsp_rproc.c | 82 --
 1 file changed, 9 insertions(+), 73 deletions(-)

diff --git a/drivers/remoteproc/imx_dsp_rproc.c 
b/drivers/remoteproc/imx_dsp_rproc.c
index 8fcda9b74545..0409b7c47d5c 100644
--- a/drivers/remoteproc/imx_dsp_rproc.c
+++ b/drivers/remoteproc/imx_dsp_rproc.c
@@ -103,12 +103,10 @@ enum imx_dsp_rp_mbox_messages {
  * @tx_ch: mailbox tx channel handle
  * @rx_ch: mailbox rx channel handle
  * @rxdb_ch: mailbox rx doorbell channel handle
- * @pd_dev: power domain device
- * @pd_dev_link: power domain device link
+ * @pd_list: power domain list
  * @ipc_handle: System Control Unit ipc handle
  * @rproc_work: work for processing virtio interrupts
  * @pm_comp: completion primitive to sync for suspend response
- * @num_domains: power domain number
  * @flags: control flags
  */
 struct imx_dsp_rproc {
@@ -121,12 +119,10 @@ struct imx_dsp_rproc {
struct mbox_chan*tx_ch;
struct mbox_chan*rx_ch;
struct mbox_chan*rxdb_ch;
-   struct device   **pd_dev;
-   struct device_link  **pd_dev_link;
+   struct dev_pm_domain_list   *pd_list;
struct imx_sc_ipc   *ipc_handle;
struct work_struct  rproc_work;
struct completion   pm_comp;
-   int num_domains;
u32 flags;
 };
 
@@ -954,74 +950,14 @@ static const struct rproc_ops imx_dsp_rproc_ops = {
 static int imx_dsp_attach_pm_domains(struct imx_dsp_rproc *priv)
 {
struct device *dev = priv->rproc->dev.parent;
-   int ret, i;
-
-   priv->num_domains = of_count_phandle_with_args(dev->of_node,
-  "power-domains",
-  "#power-domain-cells");
-
-   /* If only one domain, then no need to link the device */
-   if (priv->num_domains <= 1)
-   return 0;
-
-   priv->pd_dev = devm_kmalloc_array(dev, priv->num_domains,
- sizeof(*priv->pd_dev),
- GFP_KERNEL);
-   if (!priv->pd_dev)
-   return -ENOMEM;
-
-   priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_domains,
-  sizeof(*priv->pd_dev_link),
-  GFP_KERNEL);
-   if (!priv->pd_dev_link)
-   return -ENOMEM;
-
-   for (i = 0; i < priv->num_domains; i++) {
-   priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i);
-   if (IS_ERR(priv->pd_dev[i])) {
-   ret = PTR_ERR(priv->pd_dev[i]);
-   goto detach_pm;
-   }
-
-   /*
-* device_link_add will check priv->pd_dev[i], if it is
-* NULL, then will break.
-*/
-   priv->pd_dev_link[i] = device_link_add(dev,
-  priv->pd_dev[i],
-  DL_FLAG_STATELESS |
-  DL_FLAG_PM_RUNTIME);
-   if (!priv->pd_dev_link[i]) {
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   ret = -EINVAL;
-   goto detach_pm;
-   }
-   }
-
-   return 0;
-
-detach_pm:
-   while (--i >= 0) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return ret;
-}
-
-static int imx_dsp_detach_pm_domains(struct imx_dsp_rproc *priv)
-{
-   int i;
+   int ret;
 
-   if (priv->num_domains <= 1)
+   /* A single PM domain is already attached. */
+   if (dev->pm_domain)
return 0;
 
-   for (i = 0; i < priv->num_domains; i++) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return 0;
+   ret = dev_pm_domain_attach_list(dev, NULL, &priv->pd_list);
+   return ret < 0 ? ret : 0;
 }
 
 /**
@@ -1153,7 +1089,7 @@ static int imx_dsp_rproc_probe(struct platform_device 
*pdev)
return 0;
 
 err_detach_

[PATCH v2 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains

2024-01-05 Thread Ulf Hansson
Attaching/detaching of a device to multiple PM domains has started to
become a common operation for many drivers, typically during ->probe() and
->remove(). In most cases, this has lead to lots of boilerplate code in the
drivers.

To fixup up the situation, let's introduce a pair of helper functions,
dev_pm_domain_attach|detach_list(), that driver can use instead of the
open-coding. Note that, it seems reasonable to limit the support for these
helpers to DT based platforms, at it's the only valid use case for now.

Signed-off-by: Ulf Hansson 
---

Changes in v2:
- Fix NULL pointer bug pointed out by Nikunj.

---
 drivers/base/power/common.c | 134 
 include/linux/pm_domain.h   |  38 ++
 2 files changed, 172 insertions(+)

diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c
index 44ec20918a4d..327d168dd37a 100644
--- a/drivers/base/power/common.c
+++ b/drivers/base/power/common.c
@@ -167,6 +167,115 @@ struct device *dev_pm_domain_attach_by_name(struct device 
*dev,
 }
 EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name);
 
+/**
+ * dev_pm_domain_attach_list - Associate a device with its PM domains.
+ * @dev: The device used to lookup the PM domains for.
+ * @data: The data used for attaching to the PM domains.
+ * @list: An out-parameter with an allocated list of attached PM domains.
+ *
+ * This function helps to attach a device to its multiple PM domains. The
+ * caller, which is typically a driver's probe function, may provide a list of
+ * names for the PM domains that we should try to attach the device to, but it
+ * may also provide an empty list, in case the attach should be done for all of
+ * the available PM domains.
+ *
+ * Callers must ensure proper synchronization of this function with power
+ * management callbacks.
+ *
+ * Returns the number of attached PM domains or a negative error code in case 
of
+ * a failure. Note that, to detach the list of PM domains, the driver shall 
call
+ * dev_pm_domain_detach_list(), typically during the remove phase.
+ */
+int dev_pm_domain_attach_list(struct device *dev,
+ const struct dev_pm_domain_attach_data *data,
+ struct dev_pm_domain_list **list)
+{
+   struct device_node *np = dev->of_node;
+   struct dev_pm_domain_list *pds;
+   struct device *pd_dev = NULL;
+   int ret, i, num_pds = 0;
+   bool by_id = true;
+   u32 pd_flags = data ? data->pd_flags : 0;
+   u32 link_flags = pd_flags & PD_FLAG_NO_DEV_LINK ? 0 :
+   DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME;
+
+   if (dev->pm_domain)
+   return -EEXIST;
+
+   /* For now this is limited to OF based platforms. */
+   if (!np)
+   return 0;
+
+   if (data && data->pd_names) {
+   num_pds = data->num_pd_names;
+   by_id = false;
+   } else {
+   num_pds = of_count_phandle_with_args(np, "power-domains",
+"#power-domain-cells");
+   }
+
+   if (num_pds <= 0)
+   return 0;
+
+   pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL);
+   if (!pds)
+   return -ENOMEM;
+
+   pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs),
+   GFP_KERNEL);
+   if (!pds->pd_devs)
+   return -ENOMEM;
+
+   pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links),
+GFP_KERNEL);
+   if (!pds->pd_links)
+   return -ENOMEM;
+
+   if (link_flags && pd_flags & PD_FLAG_DEV_LINK_ON)
+   link_flags |= DL_FLAG_RPM_ACTIVE;
+
+   for (i = 0; i < num_pds; i++) {
+   if (by_id)
+   pd_dev = dev_pm_domain_attach_by_id(dev, i);
+   else
+   pd_dev = dev_pm_domain_attach_by_name(dev,
+   data->pd_names[i]);
+   if (IS_ERR_OR_NULL(pd_dev)) {
+   ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV;
+   goto err_attach;
+   }
+
+   if (link_flags) {
+   struct device_link *link;
+
+   link = device_link_add(dev, pd_dev, link_flags);
+   if (!link) {
+   ret = -ENODEV;
+   goto err_link;
+   }
+
+   pds->pd_links[i] = link;
+   }
+
+   pds->pd_devs[i] = pd_dev;
+   }
+
+   pds->num_pds = num_pds;
+   *list = pds;
+   return num_pds;
+
+err_link:
+   dev_pm_domain_detach(pd_dev, true);
+err_attach:
+   while (--i >= 0) {
+   if (pds->pd_links[i])
+   device_link_del(pds->pd_links[i]);
+   dev_pm_domain_detach(pds->pd_devs[i], true);
+   }
+   re

[PATCH v2 0/5] PM: domains: Add helpers for multi PM domains to avoid open-coding

2024-01-05 Thread Ulf Hansson
Updates in v2:
- Ccing Daniel Baluta and Iuliana Prodan the NXP remoteproc patches to
requests help with testing.
- Fixed NULL pointer bug in patch1, pointed out by Nikunj.
- Added some tested/reviewed-by tags.


Attaching/detaching of a device to multiple PM domains has started to become a
common operation for many drivers, typically during ->probe() and ->remove().
In most cases, this has lead to lots of boilerplate code in the drivers.

This series adds a pair of helper functions to manage the attach/detach of a
device to its multiple PM domains. Moreover, a couple of drivers have been
converted to use the new helpers as a proof of concept.

Note 1)
The changes in the drivers have only been compile tested, while the helpers
have been tested along with a couple of local dummy drivers that I have hacked
up to model both genpd providers and genpd consumers.

Note 2)
I was struggling to make up mind if we should have a separate helper to attach
all available power-domains described in DT, rather than providing "NULL" to the
dev_pm_domain_attach_list(). I decided not to, but please let me know if you
prefer the other option.

Note 3)
For OPP integration, as a follow up I am striving to make the
dev_pm_opp_attach_genpd() redundant. Instead I think we should move towards
using dev_pm_opp_set_config()->_opp_set_required_devs(), which would allow us to
use the helpers that $subject series is adding.

Kind regards
Ulf Hansson


Ulf Hansson (5):
  PM: domains: Add helper functions to attach/detach multiple PM domains
  remoteproc: imx_dsp_rproc: Convert to
dev_pm_domain_attach|detach_list()
  remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()
  remoteproc: qcom_q6v5_adsp: Convert to
dev_pm_domain_attach|detach_list()
  media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec

 drivers/base/power/common.c   | 134 +++
 drivers/media/platform/qcom/venus/core.c  |  12 +-
 drivers/media/platform/qcom/venus/core.h  |   7 +-
 .../media/platform/qcom/venus/pm_helpers.c|  48 ++
 drivers/remoteproc/imx_dsp_rproc.c|  82 +
 drivers/remoteproc/imx_rproc.c|  73 +---
 drivers/remoteproc/qcom_q6v5_adsp.c   | 160 --
 include/linux/pm_domain.h |  38 +
 8 files changed, 289 insertions(+), 265 deletions(-)

-- 
2.34.1



Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
> 
> Signed-off-by: Yunsheng Lin 
> Acked-by: Jason Wang 

Looks good to me.

Reviewed-by: Alexander Duyck 




Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin 
> CC: Alexander Duyck 
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c | 4 ++--
>  net/core/sock.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
> *net, unsigned int sz,
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);
>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9a16305cf985..1f0b36dd81b5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct 
> page_frag_cache *nc,
>   gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> - __GFP_NOMEMALLOC;
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>   page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>   PAGE_FRAG_CACHE_MAX_ORDER);
>   nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 446e945f736b..d643332c3ee5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
> page_frag *pfrag, gfp_t gfp)
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);
>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;

Looks fine to me.

One thing you may want to consider would be to place this all in an
inline function that could just consolidate all the code.

Reviewed-by: Alexander Duyck 




Re: [RFC PATCH 4/5] regulator: add 88pm88x regulators driver

2024-01-05 Thread Mark Brown
On Thu, Dec 28, 2023 at 10:39:13AM +0100, Karel Balej wrote:

> @@ -68,6 +68,21 @@ static struct mfd_cell pm886_devs[] = {
>   .num_resources = ARRAY_SIZE(pm88x_onkey_resources),
>   .resources = pm88x_onkey_resources,
>   },
> + {
> + .name = "88pm88x-regulator",
> + .id = PM88X_REGULATOR_ID_LDO2,
> + .of_compatible = "marvell,88pm88x-regulator",
> + },

Why are we adding an of_compatible here?  It's redundant, the MFD split
is a feature of Linux internals not of the hardware, and the existing
88pm8xx MFD doesn't use them.


signature.asc
Description: PGP signature


Re: [PATCH v1 2/5] livepatch: Add klp-convert tool

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:10, Lukas Hruska wrote:
> Livepatches need to access external symbols which can't be handled
> by the normal relocation mechanism. It is needed for two types
> of symbols:
> 
> --- /dev/null
> +++ b/scripts/livepatch/klp-convert.c
> @@ -0,0 +1,283 @@
[...]
> +/*
> + * Formats name of klp rela symbol based on another given section (@oldsec)
> + * and object (@obj_name) name, then returns it
> + */
> +static char *alloc_klp_rela_name(struct section *oldsec,
> + char *target_objname, struct elf *klp_elf)
> +{

Nit: Please, use @lp_obj_name instead of @target_objname.
 It would make it consistent with the caller:

klp_rela_name = alloc_klp_rela_name(oldsec, lp_obj_name, klp_elf);

and also with

#define KLP_RELOC_SYMBOL_POS(LP_OBJ_NAME, SYM_OBJ_NAME, SYM_NAME, SYM_POS)


> + char *klp_rela_name;
> + unsigned int length;
> + int err;

Just for record. The name "lp_obj_name" came from a discussion between
me and Lukas about the KLP_RELOC_SYMBOL_POS() parameter names.

The macro has two object name parameters. And LP_OBJ_NAME aka
LivePatched_OBJ_NAME looked better than TARGET_OBJ_NAME.
The name "TARGET" is too ambiguous to me.

Best Regards,
Petr



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Steven Rostedt
On Fri, 5 Jan 2024 15:26:28 +0100
Christian Brauner  wrote:

> On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > Instead of walking the dentries on mount/remount to update the gid values of
> > all the dentries if a gid option is specified on mount, just update the root
> > inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> > operations to update the permissions of the files and directories.
> > 
> > For all files and directories in the top level instance:
> > 
> >  /sys/kernel/tracing/*
> > 
> > It will use the root inode as the default permissions. The inode that
> > represents: /sys/kernel/tracing (or wherever it is mounted).
> > 
> > When an instance is created:
> > 
> >  mkdir /sys/kernel/tracing/instance/foo
> > 
> > The directory "foo" and all its files and directories underneath will use
> > the default of what foo is when it was created. A remount of tracefs will
> > not affect it.  
> 
> That kinda sounds like eventfs should actually be a separate filesystem.
> But I don't know enough about the relationship between the two concepts.

It may someday become one, as eventfs is used by perf where the rest of the
tracefs system is not.

> 
> > 
> > If a user were to modify the permissions of any file or directory in
> > tracefs, it will also no longer be modified by a change in ownership of a
> > remount.  
> 
> Very odd semantics and I would recommend to avoid that. It's just plain
> weird imo.
> 
> > 
> > The events directory, if it is in the top level instance, will use the
> > tracefs root inode as the default ownership for itself and all the files and
> > directories below it.
> > 
> > For the events directory in an instance ("foo"), it will keep the ownership
> > of what it was when it was created, and that will be used as the default
> > ownership for the files and directories beneath it.
> > 
> > Link: 
> > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> > 
> > Signed-off-by: Steven Rostedt (Google) 
> > ---  
> 
> So tracefs supports remounting with different uid/gid mount options and
> then actually wades through _all_ of the inodes and changes their
> ownership internally? What's the use-case for this? Containers?

No, in fact tracing doesn't work well with containers as tracing is global
to the entire machine. It can work with privileged containers though.

The reason for this is because tracefs was based off of debugfs where the
files and directores are created at boot up and mounted later. The reason
to do this was to allow users to mount with gid=GID to allow a given group
to have access to tracing. Without this update, tracefs would ignore it
like debugfs and proc does today.

I think its time I explain the purpose of tracefs and how it came to be.

The tracing system required a way to control tracing and read the traces.
It could have just used a new system like perf (although
/sys/kernel/debug/tracing predates perf), where it created a single ioctl()
like system call do do everything.

As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev
tracer, which both have an embedded background, I chose an interface that
could work with just an unmodified version of busybox. That is, I wanted it
to work with just cat and echo.

The main difference with tracefs compared to other file systems is that it
is a control interface, where writes happen as much as reads. The data read
is controlled. The closest thing I can think of is how cgroups work.

As tracing is a privileged operation, but something that could be changed
to allow a group to have access to, I wanted to make it easy for an admin
to decide who gets to do what at boot up via the /etc/fstab file.

> 
> Aside from optimizing this and the special semantics for this eventfs
> stuff that you really should think twice of doing, here's one idea for
> an extension that might alleviate some of the pain:
> 
> If you need flexible dynamic ownership change to e.g., be able to
> delegate (all, a directory, a single file of) tracefs to
> unprivileged/containers/whatever then you might want to consider
> supporting idmapped mounts for tracefs. Because then you can do stuff
> like:

I'm a novice here and have no idea on how id maps work ;-)

> 
> user1@localhost:~/data/scripts$ sudo mount --bind -o 
> X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt
> user1@localhost:~/data/scripts$ ls -ln /run/
> total 12
> drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
> drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
> drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
> drwx--  6 0  0  280 Jan  5 11:57 incus_agent
> prw---  1 0  00 Jan  5 11:57 initctl
> drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
> drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
> drwx--  2 0  0   40 Jan  5 11:57 lvm
> -r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
> -rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
> drwxr-xr-x  

[PATCH 2/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals

2024-01-05 Thread Luca Weiss
Configure the thermals for the PA_THERM1, MSM_THERM, PA_THERM0,
RFC_CAM_THERM, CAM_FLASH_THERM and QUIET_THERM thermistors connected to
PM6150L.

Due to hardware constraints we can only register 4 zones with
pm6150l_adc_tm, the other 2 we can register via generic-adc-thermal.

The trip points can really only be considered as placeholders, more
configuration with cooling etc. can be added later.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 191 ++
 1 file changed, 191 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts 
b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
index b7ccfe4011bb..6f435a7ed855 100644
--- a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
+++ b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
@@ -84,6 +84,20 @@ memory@efe01000 {
};
};
 
+   msm_therm_sensor: thermal-sensor-msm {
+   compatible = "generic-adc-thermal";
+   #thermal-sensor-cells = <0>;
+   io-channels = <&pm6150l_adc ADC5_AMUX_THM2_100K_PU>;
+   io-channel-names = "sensor-channel";
+   };
+
+   rear_cam_sensor: thermal-sensor-rear-cam {
+   compatible = "generic-adc-thermal";
+   #thermal-sensor-cells = <0>;
+   io-channels = <&pm6150l_adc ADC5_GPIO2_100K_PU>;
+   io-channel-names = "sensor-channel";
+   };
+
thermal-zones {
chg-skin-thermal {
polling-delay-passive = <0>;
@@ -113,6 +127,90 @@ active-config0 {
};
};
 
+   pa0-thermal {
+   polling-delay-passive = <0>;
+   polling-delay = <0>;
+   thermal-sensors = <&pm6150l_adc_tm 1>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
+   pa1-thermal {
+   polling-delay-passive = <0>;
+   polling-delay = <0>;
+   thermal-sensors = <&pm6150l_adc_tm 0>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
+   quiet-thermal {
+   polling-delay-passive = <0>;
+   polling-delay = <0>;
+   thermal-sensors = <&pm6150l_adc_tm 3>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
+   rear-cam-thermal {
+   polling-delay-passive = <1000>;
+   polling-delay = <5000>;
+   thermal-sensors = <&rear_cam_sensor>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
+   rfc-flash-thermal {
+   polling-delay-passive = <0>;
+   polling-delay = <0>;
+   thermal-sensors = <&pm6150l_adc_tm 2>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
+   sdm-skin-thermal {
+   polling-delay-passive = <1000>;
+   polling-delay = <5000>;
+   thermal-sensors = <&msm_therm_sensor>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
+
xo-thermal {
polling-delay-passive = <0>;
 

[PATCH 1/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PMK8003 thermals

2024-01-05 Thread Luca Weiss
Configure the thermals for the XO_THERM thermistor connected to the
PMK8003 (which is called PMK8350 in software).

The ADC configuration for PMK8350_ADC7_AMUX_THM1_100K_PU has already
been added in the past.

The trip points can really only be considered as placeholders, more
configuration with cooling etc. can be added later.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 25 +++
 1 file changed, 25 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts 
b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
index ade619805519..b7ccfe4011bb 100644
--- a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
+++ b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts
@@ -112,6 +112,20 @@ active-config0 {
};
};
};
+
+   xo-thermal {
+   polling-delay-passive = <0>;
+   polling-delay = <0>;
+   thermal-sensors = <&pmk8350_adc_tm 0>;
+
+   trips {
+   active-config0 {
+   temperature = <125000>;
+   hysteresis = <1000>;
+   type = "passive";
+   };
+   };
+   };
};
 };
 
@@ -490,6 +504,17 @@ conn-therm@1 {
};
 };
 
+&pmk8350_adc_tm {
+   status = "okay";
+
+   xo-therm@0 {
+   reg = <0>;
+   io-channels = <&pmk8350_vadc PMK8350_ADC7_AMUX_THM1_100K_PU>;
+   qcom,ratiometric;
+   qcom,hw-settle-time-us = <200>;
+   };
+};
+
 &pmk8350_rtc {
status = "okay";
 };

-- 
2.43.0




[PATCH 0/2] More thermal configuration for Fairphone 4

2024-01-05 Thread Luca Weiss
Add the thermal configuration for the thermistors connected to PMK8003
and PM6150L. With that all the external thermistors on the phone should
be present in the dts.

Signed-off-by: Luca Weiss 
---
Luca Weiss (2):
  arm64: dts: qcom: sm7225-fairphone-fp4: Add PMK8003 thermals
  arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals

 arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 216 ++
 1 file changed, 216 insertions(+)
---
base-commit: 50c107ee601a3d57e4fb13a9ecd9cf6c8df187f2
change-id: 20240105-fp4-thermals-f3c8e0417b70

Best regards,
-- 
Luca Weiss 




Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2024-01-05 Thread Mikko Ylinen
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote:
> Hi Dave,
> 
> On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen 
> wrote:
> 
> > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> > thoughts? Or could you confirm we should not
> > > do reclaim per cgroup at all?
> > What's the benefit of doing reclaim per cgroup?  Is that worth the extra
> > complexity?
> > 
> 
> Without reclaiming per cgroup, then we have to always set the limit to
> enclave's peak usage. This may not be efficient utilization as in many cases
> each enclave can perform fine with EPC limit set less than peak. Basically
> each group can not give up some pages for greater good without dying :-)

+1. this is exactly my thinking too. The per cgroup reclaiming is
important for the containers use case we are working on. I also think
it makes the limit more meaningful: the per-container pool of EPC pages
to use (which is independent of the enclave size).

> 
> Also with enclaves enabled with EDMM, the peak usage is not static so hard
> to determine upfront. Hence it might be an operation/deployment
> inconvenience.
> 
> In case of over-committing (sum of limits > total capacity), one cgroup at
> peak usage may require swapping pages out in a different cgroup if system is
> overloaded at that time.
> 
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit.  Right?
> > 
> 
> Although it's fair to say the majority of complexity of this series is in
> support for reclaiming per cgroup, I think it's manageable and much less
> than real VM after we removed the enclave killing parts: the only extra
> effort is to track pages in separate list and reclaim them in separately as
> opposed to track in on global list and reclaim together. The main reclaiming
> loop code is still pretty much the same as before.
> 
> 
> > If stopping at patch 5 and having less code is even remotely an option,
> > why not do _that_?
> > 
> I hope I described limitations clear enough above.
> If those are OK with users and also make it acceptable for merge quickly,

You explained the gaps very well already. I don't think the simple
version without per-cgroup reclaiming is enough for the container case.

Mikko



Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership

2024-01-05 Thread Christian Brauner
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> Instead of walking the dentries on mount/remount to update the gid values of
> all the dentries if a gid option is specified on mount, just update the root
> inode. Add .getattr, .setattr, and .permissions on the tracefs inode
> operations to update the permissions of the files and directories.
> 
> For all files and directories in the top level instance:
> 
>  /sys/kernel/tracing/*
> 
> It will use the root inode as the default permissions. The inode that
> represents: /sys/kernel/tracing (or wherever it is mounted).
> 
> When an instance is created:
> 
>  mkdir /sys/kernel/tracing/instance/foo
> 
> The directory "foo" and all its files and directories underneath will use
> the default of what foo is when it was created. A remount of tracefs will
> not affect it.

That kinda sounds like eventfs should actually be a separate filesystem.
But I don't know enough about the relationship between the two concepts.

> 
> If a user were to modify the permissions of any file or directory in
> tracefs, it will also no longer be modified by a change in ownership of a
> remount.

Very odd semantics and I would recommend to avoid that. It's just plain
weird imo.

> 
> The events directory, if it is in the top level instance, will use the
> tracefs root inode as the default ownership for itself and all the files and
> directories below it.
> 
> For the events directory in an instance ("foo"), it will keep the ownership
> of what it was when it was created, and that will be used as the default
> ownership for the files and directories beneath it.
> 
> Link: 
> https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/
> 
> Signed-off-by: Steven Rostedt (Google) 
> ---

So tracefs supports remounting with different uid/gid mount options and
then actually wades through _all_ of the inodes and changes their
ownership internally? What's the use-case for this? Containers?

Aside from optimizing this and the special semantics for this eventfs
stuff that you really should think twice of doing, here's one idea for
an extension that might alleviate some of the pain:

If you need flexible dynamic ownership change to e.g., be able to
delegate (all, a directory, a single file of) tracefs to
unprivileged/containers/whatever then you might want to consider
supporting idmapped mounts for tracefs. Because then you can do stuff
like:

user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 
u:0:1234:1' /run/ /mnt
user1@localhost:~/data/scripts$ ls -ln /run/
total 12
drwxr-xr-x  2 0  0   40 Jan  5 12:12 credentials
drwx--  2 0  0   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 0  0   60 Jan  5 11:57 dbus
drwx--  6 0  0  280 Jan  5 11:57 incus_agent
prw---  1 0  00 Jan  5 11:57 initctl
drwxrwxrwt  4 0  0   80 Jan  5 11:57 lock
drwxr-xr-x  3 0  0   60 Jan  5 11:57 log
drwx--  2 0  0   40 Jan  5 11:57 lvm
-r--r--r--  1 0  0   33 Jan  5 11:57 machine-id
-rw-r--r--  1 0  0  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 0  0   40 Jan  5 11:57 mount
drwx--  2 0  0   40 Jan  5 11:57 multipath
drwxr-xr-x  2 0  0   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 0  08 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 0  0   40 Jan  5 11:57 sudo
drwxr-xr-x 24 0  0  660 Jan  5 14:30 systemd
drwxr-xr-x  6 0  0  140 Jan  5 14:30 udev
drwxr-xr-x  4 0  0   80 Jan  5 11:58 user
-rw-rw-r--  1 0 43 2304 Jan  5 15:15 utmp

user1@localhost:~/data/scripts$ ls -ln /mnt/
total 12
drwxr-xr-x  2 1234  1000   40 Jan  5 12:12 credentials
drwx--  2 1234  1000   40 Jan  5 11:57 cryptsetup
drwxr-xr-x  2 1234  1000   60 Jan  5 11:57 dbus
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 incus_agent
prw---  1 1234  10000 Jan  5 11:57 initctl
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 lock
drwxr-xr-x  3 1234  1000   60 Jan  5 11:57 log
drwx--  2 1234  1000   40 Jan  5 11:57 lvm
-r--r--r--  1 1234  1000   33 Jan  5 11:57 machine-id
-rw-r--r--  1 1234  1000  101 Jan  5 11:58 motd.dynamic
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 mount
drwx--  2 1234  1000   40 Jan  5 11:57 multipath
drwxr-xr-x  2 1234  1000   40 Jan  5 11:57 sendsigs.omit.d
lrwxrwxrwx  1 1234  10008 Jan  5 11:57 shm -> /dev/shm
drwx--x--x  2 1234  1000   40 Jan  5 11:57 sudo
drwxr-xr-x 24 1234  1000  660 Jan  5 14:30 systemd
drwxr-xr-x  6 1234  1000  140 Jan  5 14:30 udev
drwxr-xr-x  4 1234  1000   80 Jan  5 11:58 user
-rw-rw-r--  1 1234 65534 2304 Jan  5 15:15 utmp

Where you can see that ownership of this tmpfs instance in this example
is changed. I'm not trying to advocate here but this will probably
ultimately be nicer for your users because it means that a container
manager or whatever can be handed a part of tracefs (or all of it) and
the ownership and access rights for that thing is correct. And you can
get rid of that gid based access completely.

You can change uids, gids, or both. You can spe

Re: [PATCH v1 5/5] documentation: Update on livepatch elf format

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:13, Lukas Hruska wrote:
> Add a section to Documentation/livepatch/module-elf-format.rst
> describing how klp-convert works for fixing relocations.
> 
> Signed-off-by: Lukas Hruska 

Looks good to me:

Reviewed-by: Petr Mladek 

Best Regards,
Petr



Re: [PATCH v1 4/5] livepatch: Add sample livepatch module

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:12, Lukas Hruska wrote:
> From: Josh Poimboeuf 
> 
> Add a new livepatch sample in samples/livepatch/ to make use of symbols
> that must be post-processed to enable load-time relocation resolution.
> As the new sample is to be used as an example, it is annotated with
> KLP_RELOC_SYMBOL macro.
> 
> The livepatch sample updates the function cmdline_proc_show to print the
> string referenced by the symbol saved_command_line appended by the
> string "livepatch=1".
> 
> Signed-off-by: Josh Poimboeuf 
> Signed-off-by: Lukas Hruska 
> ---
>  samples/livepatch/Makefile|  1 +
>  .../livepatch/livepatch-annotated-sample.c| 84 +++

The name is ambiguous. I would use something like livepatch-extern-symbol.c

Also it would be great to prepare a selftest. In this case, I would
suggest to livepatch a symbol from another test module so that
it does not modify the running system and the result is predictable.

Otherwise it looks good. With a better module name:

Reviewed-by: Petr Mladek 

Best Regards,
Petr



Re: [PATCH v1 3/5] kbuild/modpost: integrate klp-convert

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:11, Lukas Hruska wrote:
> From: Josh Poimboeuf 
> 
> Update the modpost program so that it does not warn about unresolved
> symbols matching the expected format which will be then resolved by
> klp-convert.

This in only one part.

The main change is that it klp-convert is called for livepatch modules
after the final linking.

I would write something like:


Call klp-convert for the livepatch modules after the final linking.

Also update the modpost tool so that it does not warn about unresolved
symbols matching the expected format which will be then resolved by
klp-convert.


> Signed-off-by: Josh Poimboeuf 
> Signed-off-by: Lukas Hruska 

Otherwise the code looks good. With the updated commit message:

Reviewed-by: Petr Mladek 

Best Regards,
Petr



Re: [PATCH v1 2/5] livepatch: Add klp-convert tool

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:10, Lukas Hruska wrote:
> Livepatches need to access external symbols which can't be handled
> by the normal relocation mechanism. It is needed for two types
> of symbols:
> 
>   + Symbols which can be local for the original livepatched function.
> The alternative implementation in the livepatch sees them
> as external symbols.
> 
>   + Symbols in modules which are exported via EXPORT_SYMBOL*(). They
> must be handled special way otherwise the livepatch module would
> depend on the livepatched one. Loading such livepatch would cause
> loading the other module as well.
> 
> The address of these symbols can be found via kallsyms. Or they can 

Please, remove the extra space at the end of the line.

> be relocated using livepatch specific relocation sections as specified
> in Documentation/livepatch/module-elf-format.txt.
> 
> Currently, there is no trivial way to embed the required information as
> requested in the final livepatch elf object. klp-convert solves this
> problem by using annotations in the elf object to convert the relocation
> accordingly to the specification, enabling it to be handled by the
> livepatch loader.
> 
> Given the above, create scripts/livepatch to hold tools developed for
> livepatches and add source files for klp-convert there.
> 
> Allow to annotate such external symbols in the livepatch by a macro
> KLP_RELOC_SYMBOL(). It will create symbol with all needed
> metadata. For example:
> 
>   extern char *saved_command_line \
>  KLP_RELOC_SYMBOL(vmlinux, vmlinux, saved_command_line, 0);
> 
> would create symbol
> 
> $>readelf -r -W :
> Relocation section '.rela.text' at offset 0x32e60 contains 10 entries:
> Offset Info Type   Symbol's Value  
> Symbol's Name + Addend
> [...]
> 0068  003c0002 R_X86_64_PC32   
> .klp.sym.rela.vmlinux.vmlinux.saved_command_line,0 - 4
> [...]
> 
> 
> Also add scripts/livepatch/klp-convert. The tool transforms symbols
> created by KLP_RELOC_SYMBOL() to object specific rela sections
> and rela entries which would later be proceed when the livepatch
> or the livepatched object is loaded.
> 
> For example, klp-convert would replace the above symbols with:

s/above symbols/above symbol/

> $> readelf -r -W 
> Relocation section '.klp.rela.vmlinux.text' at offset 0x5cb60 contains 1 
> entry:
> Offset Info Type   Symbol's Value  
> Symbol's Name + Addend
> 0068  003c0002 R_X86_64_PC32   
> .klp.sym.vmlinux.saved_command_line,0 - 4
> 
> klp-convert relies on libelf and on a list implementation. Add files
> scripts/livepatch/elf.c and scripts/livepatch/elf.h, which are a libelf
> interfacing layer and scripts/livepatch/list.h, which is a list
> implementation.
> 
> Update Makefiles to correctly support the compilation of the new tool,
> update MAINTAINERS file and add a .gitignore file.
>
> ---
>  MAINTAINERS |   1 +
>  include/linux/livepatch.h   |  19 +
>  scripts/Makefile|   1 +
>  scripts/livepatch/.gitignore|   1 +
>  scripts/livepatch/Makefile  |   5 +
>  scripts/livepatch/elf.c | 817 
>  scripts/livepatch/elf.h |  73 +++

I see a similar code in

tools/objtool/elf.c
tools/objtool/include/objtool/elf.h

Both variants have been written by Josh. I wonder if we could share
one implementation. Josh?

>  scripts/livepatch/klp-convert.c | 283 +++
>  scripts/livepatch/klp-convert.h |  42 ++
>  scripts/livepatch/list.h| 391 +++

And probably also the list.h

>  10 files changed, 1633 insertions(+)
>  create mode 100644 scripts/livepatch/.gitignore
>  create mode 100644 scripts/livepatch/Makefile
>  create mode 100644 scripts/livepatch/elf.c
>  create mode 100644 scripts/livepatch/elf.h
>  create mode 100644 scripts/livepatch/klp-convert.c
>  create mode 100644 scripts/livepatch/klp-convert.h
>  create mode 100644 scripts/livepatch/list.h
> 
> --- /dev/null
> +++ b/scripts/livepatch/klp-convert.c
> @@ -0,0 +1,283 @@
[...]
> +/* Converts rela symbol names */
> +static bool convert_symbol(struct symbol *s)
> +{
> + char lp_obj_name[MODULE_NAME_LEN];
> + char sym_obj_name[MODULE_NAME_LEN];
> + char sym_name[KSYM_NAME_LEN];
> + char *klp_sym_name;
> + unsigned long sym_pos;
> + int poslen;
> + unsigned int length;
> +
> + static_assert(MODULE_NAME_LEN <= 56, "Update limit in the below 
> sscanf()");

IMHO, there should be "< 56" instead of "<= 56". The sscanf is limited by %55.

Also we should check KSYM_NAME_LEN. Similar to to check in klp_resolve_symbols()

static_assert(MODULE_NAME_LEN < 56 || KSYM_NAME_LEN != 512,
  "Update limit in the below sscanf()");

> +
> + if (sscanf(s->name, KLP_SYM_RELA_PREFIX "%55[^.].%55[^.].%511[^,],%lu",
> +   

[PATCH v9] bus: mhi: host: Add tracing support

2024-01-05 Thread Krishna chaitanya chundru
This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Change the implementation of the arrays which has enum to strings mapping
to make it consistent in both trace header file and other files.

Where ever the trace events are added, debug messages are removed.

Signed-off-by: Krishna chaitanya chundru 
Reviewed-by: "Steven Rostedt (Google)" 
---
Changes in v9:
- Change the implementations of some array so that the strings to enum mapping
- is same in both trace header and other files as suggested by steve.
- Link to v8: 
https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com

Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign as 
suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com

Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com

Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com

Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print the address 
to avoid
- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com

Changes in v4:
- Fix compilation issues in previous patch which happended due to rebasing.
- In the defconfig FTRACE config is not enabled due to that the compilation 
issue is not
- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com

Changes in v3:
- move trace header file from include/trace/events to drivers/bus/mhi/host/ so 
that
- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid holes in 
the buffer.
- removed the mhi_to_physical function as this can give security issues.
- removed macros to define strings as we can get those from driver headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com

Changes in v2:
- Passing the raw state into the trace event and using  __print_symbolic() as 
suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
---
 drivers/bus/mhi/common.h|  38 +++---
 drivers/bus/mhi/host/init.c |  63 +
 drivers/bus/mhi/host/internal.h |  40 ++
 drivers/bus/mhi/host/main.c |  19 ++-
 drivers/bus/mhi/host/pm.c   |   7 +-
 drivers/bus/mhi/host/trace.h| 275 
 6 files changed, 378 insertions(+), 64 deletions(-)

diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h
index f794b9c8049e..dda340aaed95 100644
--- a/drivers/bus/mhi/common.h
+++ b/drivers/bus/mhi/common.h
@@ -297,30 +297,30 @@ struct mhi_ring_element {
__le32 dword[2];
 };
 
+#define MHI_STATE_LIST \
+   mhi_state(RESET,"RESET")\
+   mhi_state(READY,"READY")\
+   mhi_state(M0,   "M0")   \
+   mhi_state(M1,   "M1")   \
+   mhi_state(M2,   "M2")   \
+   mhi_state(M3,   "M3")   \
+   mhi_state(M3_FAST,  "M3_FAST")  \
+   mhi_state(BHI,  "BHI")  \
+   mhi_state_end(SYS_ERR,  "SYS ERROR")
+
+#undef mhi_state
+#undef mhi_state_end
+
+#define mhi_state(a, b)case MHI_STATE_##a: return b;
+#define mhi_state_end(a, b)case MHI_STATE_##a: return b;
+
 static inline const char *mhi_state_str(enum mhi_state state)
 {
switch (state) {
-   case MHI_STATE_RESET:
-   return "RESET";
-   case MHI_STATE_READY:
-   return "READY";
-   case MHI_STATE_M0:
-   return "M0";
-   case MHI_STATE_M1:
-   return "M1";
-   case MHI_STATE_M2:
-   return "M2";
-   case MHI_STATE_M3:
-   return "M3";
-   case MHI_STATE_M3_FAST:
-   return "M3 FAST";
-   case MHI_STATE_BHI:
-   return "BHI";
-   case MHI_STATE_SYS_ERR:
-   return "SYS ERROR";
+   MHI_STATE_LIST
default:
return "Unknown state";
}
-};
+}
 
 #endif /* _MHI_COMMON_H */
diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
in

Re: [PATCH v8] bus: mhi: host: Add tracing support

2024-01-05 Thread Krishna Chaitanya Chundru



On 1/5/2024 10:47 AM, Baochen Qiang wrote:



On 1/4/2024 12:47 PM, Krishna Chaitanya Chundru wrote:

Hi Steven,

Can you please review this.

Thanks & Regards,

Krishna Chaitanya.

On 12/7/2023 10:00 AM, Krishna chaitanya chundru wrote:

This change adds ftrace support for following functions which
helps in debugging the issues when there is Channel state & MHI
state change and also when we receive data and control events:
1. mhi_intvec_mhi_states
2. mhi_process_data_event_ring
3. mhi_process_ctrl_ev_ring
4. mhi_gen_tre
5. mhi_update_channel_state
6. mhi_tryset_pm_state
7. mhi_pm_st_worker

Where ever the trace events are added, debug messages are removed.
Is there a reason why debug messages have to be removed? From the view 
of MHI controller, we often need MHI logs to debug, so is it possbile 
to preserve those logs?


The trace is being replaced by the debug message and it would be 
preferred to have one mechanism or the other, not both.


you will still see these logs in traces.

- Krishna Chaitanya.



Signed-off-by: Krishna chaitanya chundru 
---
Changes in v8:
- Pass the structure and derefernce the variables in TP_fast_assign 
as suggested by steve
- Link to v7: 
https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com


Changes in v7:
- change log format as pointed by mani.
- Link to v6: 
https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com


Changes in v6:
- use 'rp' directly as suggested by jeffrey.
- Link to v5: 
https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com


Changes in v5:
- Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
- Instead of converting to u64 to print address, use %px to print 
the address to avoid

- warnings in some platforms.
- Link to v4: 
https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com


Changes in v4:
- Fix compilation issues in previous patch which happended due to 
rebasing.
- In the defconfig FTRACE config is not enabled due to that the 
compilation issue is not

- seen in my workspace.
- Link to v3: 
https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com


Changes in v3:
- move trace header file from include/trace/events to 
drivers/bus/mhi/host/ so that

- we can include driver header files.
- Use macros directly in the trace events as suggested Jeffrey Hugo.
- Reorder the structure in the events as suggested by steve to avoid 
holes in the buffer.
- removed the mhi_to_physical function as this can give security 
issues.
- removed macros to define strings as we can get those from driver 
headers.
- Link to v2: 
https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com


Changes in v2:
- Passing the raw state into the trace event and using 
__print_symbolic() as suggested by bjorn.
- Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by 
bjorn.

- Fixed the kernel test rebot issues.
- Link to v1: 
https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com

---
  drivers/bus/mhi/host/init.c  |   3 +
  drivers/bus/mhi/host/main.c  |  19 ++--
  drivers/bus/mhi/host/pm.c    |   7 +-
  drivers/bus/mhi/host/trace.h | 205 
+++

  4 files changed, 221 insertions(+), 13 deletions(-)

diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c
index f78aefd2d7a3..6acb85f4c5f8 100644
--- a/drivers/bus/mhi/host/init.c
+++ b/drivers/bus/mhi/host/init.c
@@ -20,6 +20,9 @@
  #include 
  #include "internal.h"
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
  static DEFINE_IDA(mhi_controller_ida);
  const char * const mhi_ee_str[MHI_EE_MAX] = {
diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
index dcf627b36e82..189f4786403e 100644
--- a/drivers/bus/mhi/host/main.c
+++ b/drivers/bus/mhi/host/main.c
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include "internal.h"
+#include "trace.h"
  int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl,
    void __iomem *base, u32 offset, u32 *out)
@@ -491,11 +492,8 @@ irqreturn_t mhi_intvec_threaded_handler(int 
irq_number, void *priv)

  state = mhi_get_mhi_state(mhi_cntrl);
  ee = mhi_get_exec_env(mhi_cntrl);
-    dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n",
-    TO_MHI_EXEC_STR(mhi_cntrl->ee),
-    mhi_state_str(mhi_cntrl->dev_state),
-    TO_MHI_EXEC_STR(ee), mhi_state_str(state));
+    trace_mhi_intvec_states(mhi_cntrl, ee, state);
  if (state == MHI_STATE_SYS_ERR) {
  dev_dbg(dev, "System error detected\n");
  pm_state = mhi_tryset_pm_state(mhi_cntrl,
@@ -832,6 +830,8 @@ int mhi_process_ctrl_ev_ring(struct 
mhi_controller *mhi_cntrl,

  while (dev_rp != local_rp) {
  enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
+    trace_mhi_ctrl_event(mhi_cntrl, local_rp);
+
  switch (type) {
  case MHI_PKT_TYPE_BW_REQ_EVENT:
  {
@@ -997,6 +

[PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker

2024-01-05 Thread Hou Tao
From: Hou Tao 

When invoking virtio_fs_enqueue_req() through kworker, both the
allocation of the sg array and the bounce buffer still use GFP_ATOMIC.
Considering the size of both the sg array and the bounce buffer may be
greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the
possibility of memory allocation failure.

Signed-off-by: Hou Tao 
---
Change log:
v2:
  * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by Matthew)

v1: 
https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com

 fs/fuse/virtio_fs.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 3aac31d451985..8cf518624ce9e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -87,7 +87,8 @@ struct virtio_fs_req_work {
 };
 
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight);
+struct fuse_req *req, bool in_flight,
+gfp_t gfp);
 
 static const struct constant_table dax_param_enums[] = {
{"always",  FUSE_DAX_ALWAYS },
@@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
list_del_init(&req->list);
spin_unlock(&fsvq->lock);
 
-   ret = virtio_fs_enqueue_req(fsvq, req, true);
+   ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(&fsvq->lock);
@@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 }
 
 /* Allocate and copy args into req->argbuf */
-static int copy_args_to_argbuf(struct fuse_req *req)
+static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp)
 {
struct fuse_args *args = req->args;
unsigned int offset = 0;
@@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req)
len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) +
  fuse_len_args(num_out, args->out_args);
 
-   req->argbuf = kmalloc(len, GFP_ATOMIC);
+   req->argbuf = kmalloc(len, gfp);
if (!req->argbuf)
return -ENOMEM;
 
@@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist 
*sg,
 
 /* Add a request to a virtqueue and kick the device */
 static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
-struct fuse_req *req, bool in_flight)
+struct fuse_req *req, bool in_flight,
+gfp_t gfp)
 {
/* requests need at least 4 elements */
struct scatterlist *stack_sgs[6];
@@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
/* Does the sglist fit on the stack? */
total_sgs = sg_count_fuse_req(req);
if (total_sgs > ARRAY_SIZE(stack_sgs)) {
-   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC);
-   sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC);
+   sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp);
+   sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp);
if (!sgs || !sg) {
ret = -ENOMEM;
goto out;
@@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
*fsvq,
}
 
/* Use a bounce buffer since stack args cannot be mapped */
-   ret = copy_args_to_argbuf(req);
+   ret = copy_args_to_argbuf(req, gfp);
if (ret < 0)
goto out;
 
@@ -1245,7 +1247,7 @@ __releases(fiq->lock)
 fuse_len_args(req->args->out_numargs, req->args->out_args));
 
fsvq = &fs->vqs[queue_id];
-   ret = virtio_fs_enqueue_req(fsvq, req, false);
+   ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC);
if (ret < 0) {
if (ret == -ENOMEM || ret == -ENOSPC) {
/*
-- 
2.29.2




Re: [PATCH v1 1/5] livepatch: Create and include UAPI headers

2024-01-05 Thread Petr Mladek
On Mon 2023-11-06 17:25:09, Lukas Hruska wrote:
> From: Josh Poimboeuf 
> 
> Define klp prefixes in include/uapi/linux/livepatch.h, and use them for
> replacing hard-coded values in kernel/livepatch/core.c.
> 
> Signed-off-by: Josh Poimboeuf 
> Signed-off-by: Lukas Hruska 

Reviewed-by: Petr Mladek 

Best Regards,
Petr



Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-05 Thread Maxime Coquelin




On 1/5/24 10:59, Eugenio Perez Martin wrote:

On Fri, Jan 5, 2024 at 9:12 AM Maxime Coquelin
 wrote:




On 1/5/24 03:45, Jason Wang wrote:

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:


Virtio-net driver control queue implementation is not safe
when used with VDUSE. If the VDUSE application does not
reply to control queue messages, it currently ends up
hanging the kernel thread sending this command.

Some work is on-going to make the control queue
implementation robust with VDUSE. Until it is completed,
let's fail features check if any control-queue related
feature is requested.

Signed-off-by: Maxime Coquelin 
---
   drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
   1 file changed, 13 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0486ff672408..94f54ea2eb06 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -28,6 +28,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 

   #include "iova_domain.h"
@@ -46,6 +47,15 @@

   #define IRQ_UNBOUND -1

+#define VDUSE_NET_INVALID_FEATURES_MASK \
+   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
+BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
+BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
+BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
+BIT_ULL(VIRTIO_NET_F_MQ) | \
+BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
+BIT_ULL(VIRTIO_NET_F_RSS))


We need to make this as well:

VIRTIO_NET_F_CTRL_GUEST_OFFLOADS


I missed it, and see others have been added in the Virtio spec
repository (BTW, I see this specific one is missing in the dependency
list [0], I will submit a patch).

I wonder if it is not just simpler to just check for
VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out,
the VDUSE driver won't be the one violating the spec so it should be
good?

It will avoid having to update the mask if new features depending on it
are added (or forgetting to update it).

WDYT?



I think it is safer to work with a whitelist, instead of a blacklist.
As any new feature might require code changes in QEMU. Is that
possible?


Well, that's how it was done in previous revision. :)

I changed to a blacklist for consistency with block device's WCE feature
check after Jason's comment.

I'm not sure moving back to a whitelist brings much advantages when
compared to the effort of keeping it up to date. Just blacklisting
VIRTIO_NET_F_CTRL_VQ is enough in my opinion.

Thanks,
Maxime


Thanks,
Maxime

[0]:
https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129

Other than this,

Acked-by: Jason Wang 

Thanks


+
   struct vduse_virtqueue {
  u16 index;
  u16 num_max;
@@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config 
*config)
  if ((config->device_id == VIRTIO_ID_BLOCK) &&
  (config->features & (1ULL << 
VIRTIO_BLK_F_CONFIG_WCE)))
  return false;
+   else if ((config->device_id == VIRTIO_ID_NET) &&
+   (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
+   return false;

  return true;
   }
--
2.43.0













Re: [PATCH v6 3/3] vduse: enable Virtio-net device type

2024-01-05 Thread Eugenio Perez Martin
On Thu, Jan 4, 2024 at 4:39 PM Maxime Coquelin
 wrote:
>
> This patch adds Virtio-net device type to the supported
> devices types.
>
> Initialization fails if the device does not support
> VIRTIO_F_VERSION_1 feature, in order to guarantee the
> configuration space is read-only. It also fails with
> -EPERM if the CAP_NET_ADMIN is missing.
>
> Signed-off-by: Maxime Coquelin 

Reviewed-by: Eugenio Pérez 

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 94f54ea2eb06..4057b34ff995 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -151,6 +151,7 @@ static struct workqueue_struct *vduse_irq_bound_wq;
>
>  static u32 allowed_device_id[] = {
> VIRTIO_ID_BLOCK,
> +   VIRTIO_ID_NET,
>  };
>
>  static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> @@ -1694,6 +1695,10 @@ static bool features_is_valid(struct vduse_dev_config 
> *config)
> (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
> return false;
>
> +   if ((config->device_id == VIRTIO_ID_NET) &&
> +   !(config->features & (1ULL << VIRTIO_F_VERSION_1)))
> +   return false;
> +
> return true;
>  }
>
> @@ -1801,6 +1806,10 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
> int ret;
> struct vduse_dev *dev;
>
> +   ret = -EPERM;
> +   if ((config->device_id == VIRTIO_ID_NET) && !capable(CAP_NET_ADMIN))
> +   goto err;
> +
> ret = -EEXIST;
> if (vduse_find_dev(config->name))
> goto err;
> @@ -2044,6 +2053,7 @@ static const struct vdpa_mgmtdev_ops 
> vdpa_dev_mgmtdev_ops = {
>
>  static struct virtio_device_id id_table[] = {
> { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> +   { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
> { 0 },
>  };
>
> --
> 2.43.0
>
>




Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-05 Thread Eugenio Perez Martin
On Fri, Jan 5, 2024 at 9:12 AM Maxime Coquelin
 wrote:
>
>
>
> On 1/5/24 03:45, Jason Wang wrote:
> > On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
> >  wrote:
> >>
> >> Virtio-net driver control queue implementation is not safe
> >> when used with VDUSE. If the VDUSE application does not
> >> reply to control queue messages, it currently ends up
> >> hanging the kernel thread sending this command.
> >>
> >> Some work is on-going to make the control queue
> >> implementation robust with VDUSE. Until it is completed,
> >> let's fail features check if any control-queue related
> >> feature is requested.
> >>
> >> Signed-off-by: Maxime Coquelin 
> >> ---
> >>   drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
> >>   1 file changed, 13 insertions(+)
> >>
> >> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> >> b/drivers/vdpa/vdpa_user/vduse_dev.c
> >> index 0486ff672408..94f54ea2eb06 100644
> >> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >> @@ -28,6 +28,7 @@
> >>   #include 
> >>   #include 
> >>   #include 
> >> +#include 
> >>   #include 
> >>
> >>   #include "iova_domain.h"
> >> @@ -46,6 +47,15 @@
> >>
> >>   #define IRQ_UNBOUND -1
> >>
> >> +#define VDUSE_NET_INVALID_FEATURES_MASK \
> >> +   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
> >> +BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
> >> +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
> >> +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> >> +BIT_ULL(VIRTIO_NET_F_MQ) | \
> >> +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
> >> +BIT_ULL(VIRTIO_NET_F_RSS))
> >
> > We need to make this as well:
> >
> > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>
> I missed it, and see others have been added in the Virtio spec
> repository (BTW, I see this specific one is missing in the dependency
> list [0], I will submit a patch).
>
> I wonder if it is not just simpler to just check for
> VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out,
> the VDUSE driver won't be the one violating the spec so it should be
> good?
>
> It will avoid having to update the mask if new features depending on it
> are added (or forgetting to update it).
>
> WDYT?
>

I think it is safer to work with a whitelist, instead of a blacklist.
As any new feature might require code changes in QEMU. Is that
possible?

> Thanks,
> Maxime
>
> [0]:
> https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129
> > Other than this,
> >
> > Acked-by: Jason Wang 
> >
> > Thanks
> >
> >> +
> >>   struct vduse_virtqueue {
> >>  u16 index;
> >>  u16 num_max;
> >> @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct 
> >> vduse_dev_config *config)
> >>  if ((config->device_id == VIRTIO_ID_BLOCK) &&
> >>  (config->features & (1ULL << 
> >> VIRTIO_BLK_F_CONFIG_WCE)))
> >>  return false;
> >> +   else if ((config->device_id == VIRTIO_ID_NET) &&
> >> +   (config->features & 
> >> VDUSE_NET_INVALID_FEATURES_MASK))
> >> +   return false;
> >>
> >>  return true;
> >>   }
> >> --
> >> 2.43.0
> >>
> >
>
>




[PATCH v10 2/2] tracing: Allow user-space mapping of the ring-buffer

2024-01-05 Thread Vincent Donnefort
Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index f950648b0ba9..8c49489c5867 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -26,4 +26,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;/* Len of this struct */
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 46dbe22121e9..cfeaf2cd204e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = &info->iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(&trace_types_lock);
 
iter->wait_index++;
@@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = &tracing_buffers_vmops;
+
+   return ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+}
+
 static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.rea

[PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions

2024-01-05 Thread Vincent Donnefort
In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..f950648b0ba9
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_TRACE_MMAP_H_
+#define _UAPI_TRACE_MMAP_H_
+
+#include 
+
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
+   __u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
+   __u32   read;   /* Number of bytes read on the 
reader subbuf */
+   } reader;
+
+   __u32   subbuf_size;/* Size of each subbuf 
including the header */
+   __u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */
+
+   __u32   meta_page_size; /* Size of the meta-page */
+   __u32   meta_struct_len;/* Len of this struct */
+};
+
+#endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 173d2595ce2d..63426e525781 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -338,6 +338,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -388,6 +389,7 @@ struct rb_irq_work {
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
+   boolis_cpu_buffer;
 };
 
 /*
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -739,6 +747,33 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
return (dirty * 100) > (full * nr_pages);
 }
 
+static void rb_update_meta_page(struct rb_irq_work *rbwork)
+{
+   struct ring_buffer_per_cpu *cpu_buffer;
+
+   if (!rbwork->is_cpu_buffer)
+   return;
+   /*
+* If the waiter is a cpu_buffer, this might be due to a
+* userspace mapping. Let's update the meta-page.
+*/
+   cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu,
+

[PATCH v10 0/2] Introducing trace buffer mapping by user-space

2024-01-05 Thread Vincent Donnefort
The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature in libtracefs can be found here:

  https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home

Vincent

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

---

Vincent Donnefort (2):
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer

 include/linux/ring_buffer.h |   7 +
 include/uapi/linux/trace_mmap.h |  31 +++
 kernel/trace/ring_buffer.c  | 384 +++-
 kernel/trace/trace.c|  79 ++-
 4 files changed, 497 insertions(+), 4 deletions(-)
 create mode 100644 include/uapi/linux/trace_mmap.h


base-commit: 3cb3091138ca0921c4569bcf7ffa062519639b6a
-- 
2.43.0.472.g3155946c3a-goog




Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events

2024-01-05 Thread Michal Koutný
On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang 
 wrote:
> From: Kristen Carlson Accardi 
> 
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
> 
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.

I notice now that the callbacks are per resource and per cgroup.
I think that is an overkill that complicates misc_cg allocation code
with parent's callbacks while freeing is done by own callbacks.
It's possibly too much liberal to keep objects' lifecycle understandable
in the long term too. 

For some uniformity, I'd suggest struct blkcg_policy array in
block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum
misc_res_type instead of dynamic index assignment as
blkcg_policy_registeer() does.)

I hope you don't really need per-cgroup callbacks :-)

Michal


signature.asc
Description: PGP signature


Re: [PATCH v6 1/3] vduse: validate block features only with block devices

2024-01-05 Thread Eugenio Perez Martin
On Thu, Jan 4, 2024 at 4:38 PM Maxime Coquelin
 wrote:
>
> This patch is preliminary work to enable network device
> type support to VDUSE.
>
> As VIRTIO_BLK_F_CONFIG_WCE shares the same value as
> VIRTIO_NET_F_HOST_TSO4, we need to restrict its check
> to Virtio-blk device type.
>
> Acked-by: Jason Wang 
> Reviewed-by: Xie Yongji 
> Signed-off-by: Maxime Coquelin 

Reviewed-by: Eugenio Pérez 

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 0ddd4b8abecb..0486ff672408 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1671,13 +1671,14 @@ static bool device_is_allowed(u32 device_id)
> return false;
>  }
>
> -static bool features_is_valid(u64 features)
> +static bool features_is_valid(struct vduse_dev_config *config)
>  {
> -   if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +   if (!(config->features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> return false;
>
> /* Now we only support read-only configuration space */
> -   if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> +   if ((config->device_id == VIRTIO_ID_BLOCK) &&
> +   (config->features & (1ULL << 
> VIRTIO_BLK_F_CONFIG_WCE)))
> return false;
>
> return true;
> @@ -1704,7 +1705,7 @@ static bool vduse_validate_config(struct 
> vduse_dev_config *config)
> if (!device_is_allowed(config->device_id))
> return false;
>
> -   if (!features_is_valid(config->features))
> +   if (!features_is_valid(config))
> return false;
>
> return true;
> --
> 2.43.0
>
>




Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-05 Thread Maxime Coquelin




On 1/5/24 03:45, Jason Wang wrote:

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:


Virtio-net driver control queue implementation is not safe
when used with VDUSE. If the VDUSE application does not
reply to control queue messages, it currently ends up
hanging the kernel thread sending this command.

Some work is on-going to make the control queue
implementation robust with VDUSE. Until it is completed,
let's fail features check if any control-queue related
feature is requested.

Signed-off-by: Maxime Coquelin 
---
  drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
  1 file changed, 13 insertions(+)

diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0486ff672408..94f54ea2eb06 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 

  #include "iova_domain.h"
@@ -46,6 +47,15 @@

  #define IRQ_UNBOUND -1

+#define VDUSE_NET_INVALID_FEATURES_MASK \
+   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
+BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
+BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
+BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
+BIT_ULL(VIRTIO_NET_F_MQ) | \
+BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
+BIT_ULL(VIRTIO_NET_F_RSS))


We need to make this as well:

VIRTIO_NET_F_CTRL_GUEST_OFFLOADS


I missed it, and see others have been added in the Virtio spec
repository (BTW, I see this specific one is missing in the dependency
list [0], I will submit a patch).

I wonder if it is not just simpler to just check for
VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out,
the VDUSE driver won't be the one violating the spec so it should be
good?

It will avoid having to update the mask if new features depending on it
are added (or forgetting to update it).

WDYT?

Thanks,
Maxime

[0]: 
https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129

Other than this,

Acked-by: Jason Wang 

Thanks


+
  struct vduse_virtqueue {
 u16 index;
 u16 num_max;
@@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config 
*config)
 if ((config->device_id == VIRTIO_ID_BLOCK) &&
 (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)))
 return false;
+   else if ((config->device_id == VIRTIO_ID_NET) &&
+   (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
+   return false;

 return true;
  }
--
2.43.0