Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
Hi Vivek, On 1/6/2024 5:27 AM, Vivek Goyal wrote: > On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote: >> On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote: >>> On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote: On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: >> From: Hou Tao >> >> When invoking virtio_fs_enqueue_req() through kworker, both the >> allocation of the sg array and the bounce buffer still use GFP_ATOMIC. >> Considering the size of both the sg array and the bounce buffer may be >> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the >> possibility of memory allocation failure. >> > What's the practical benefit of this patch. Looks like if memory > allocation fails, we keep retrying at interval of 1ms and don't > return error to user space. Motivation for GFP_NOFS comes another fix proposed for virtiofs [1] in which when trying to insert a big kernel module kept in a cache-disabled virtiofs, the length of fuse args will be large (e.g., 10MB), and the memory allocation in copy_args_to_argbuf() will fail forever. The proposed fix tries to fix the problem by limit the length of data kept in fuse arg, but because the limitation is still large (256KB in that patch), so I think using GFP_NOFS will also be helpful for such memory allocation. [1]: https://lore.kernel.org/linux-fsdevel/20240103105929.1902658-1-hou...@huaweicloud.com/ You don't deplete the atomic reserves unnecessarily? >>> Sounds reasonable. >>> >>> With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block >>> indefinitely till memory can be allocated. >> If you need the "loop indefinitely" behaviour, that's >> GFP_NOFS | __GFP_NOFAIL. If you're actually doing something yourself >> which can free up memory, this is a bad choice. If you're just sleeping >> and retrying, you might as well have the MM do that for you. Even with GFP_NOFS, I think -ENOMEM is still possible, so the retry logic is still necessary. > I probably don't want to wait indefinitely. There might be some cases > where I might want to return error to user space. For example, if > virtiofs device has been hot-unplugged, then there is no point in > waiting indefinitely for memory allocation. Even if memory was allocated, > soon we will return error to user space with -ENOTCONN. > > We are currently not doing that check after memory allocation failure but > we probably could as an optimization. Yes. It seems virtio_fs_enqueue_req() only checks fsvq->connected before writing sg list to virtual queue, so if the virtio device is hot-unplugged and the free memory is low, it may do unnecessary retry. Even worse, it may hang. I volunteer to post a patch to check the connected status after memory allocation failed if you are OK with that. > > So this patch looks good to me as it is. Thanks Hou Tao. > > Reviewed-by: Vivek Goyal > > Thanks > Vivek
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On 1/6/2024 4:21 AM, Matthew Wilcox wrote: > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: >> On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: >>> From: Hou Tao >>> >>> When invoking virtio_fs_enqueue_req() through kworker, both the >>> allocation of the sg array and the bounce buffer still use GFP_ATOMIC. >>> Considering the size of both the sg array and the bounce buffer may be >>> greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the >>> possibility of memory allocation failure. >>> >> What's the practical benefit of this patch. Looks like if memory >> allocation fails, we keep retrying at interval of 1ms and don't >> return error to user space. > You don't deplete the atomic reserves unnecessarily? Beside that, I think the proposed GFP_NOFS may reduce unnecessary retries. I Should mention that in the commit message. Should I post a v3 to do that ?
Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states
On Fri, 05 Jan 2024 11:57:03 -0600, Dave Hansen wrote: On 10/30/23 11:20, Haitao Huang wrote: @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { spin_lock(&sgx_global_lru.lock); - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { - /* The page is being reclaimed. */ - if (list_empty(&page->list)) { - spin_unlock(&sgx_global_lru.lock); - return -EBUSY; - } - - list_del(&page->list); - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; + if (sgx_epc_page_reclaim_in_progress(page->flags)) { + spin_unlock(&sgx_global_lru.lock); + return -EBUSY; } + + list_del(&page->list); + sgx_epc_page_reset_state(page); I want to know how much if this series is basically line-for-line abstraction shifting like: - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; + sgx_epc_page_reset_state(page); versus actually adding complexity. That way, I might be able to offer some advice on where this can be pared down. That's really hard to do with the current series. Please don't just "introduce new page states". This should have first abstracted out the sgx_epc_page_reclaim_in_progress() operation, using the list_empty() check as the implementation. Then, in a separate patch, introduce the concept of the "reclaim in progress" flag and finally flip the implementation over. Ditto for the sgx_epc_page_reset_state() abstraction. It should have been introduced separately as a concept and then have the implementation changed. On in to patch 10 (which is much too big) which introduces the sgx_lru_list() abstraction. Sure. I'll try to refactor according to this plan. Thanks Haitao
Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events
On Fri, 05 Jan 2024 03:45:02 -0600, Michal Koutný wrote: On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang wrote: From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. I notice now that the callbacks are per resource and per cgroup. I think that is an overkill that complicates misc_cg allocation code with parent's callbacks while freeing is done by own callbacks. It's possibly too much liberal to keep objects' lifecycle understandable in the long term too. For some uniformity, I'd suggest struct blkcg_policy array in block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum misc_res_type instead of dynamic index assignment as blkcg_policy_registeer() does.) I hope you don't really need per-cgroup callbacks :-) Michal Sure I can do that. IIUC, you are suggesting something similar how capacity is set via misc_cg_set_capacity for each resource type. Here we make a misc_cg_set_ops() which stores the ops pointers in a static array, then use them indexed with the resource type. Thanks Haitao
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On Fri, Jan 05, 2024 at 08:57:55PM +, Matthew Wilcox wrote: > On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote: > > On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote: > > > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: > > > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: > > > > > From: Hou Tao > > > > > > > > > > When invoking virtio_fs_enqueue_req() through kworker, both the > > > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC. > > > > > Considering the size of both the sg array and the bounce buffer may be > > > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower > > > > > the > > > > > possibility of memory allocation failure. > > > > > > > > > > > > > What's the practical benefit of this patch. Looks like if memory > > > > allocation fails, we keep retrying at interval of 1ms and don't > > > > return error to user space. > > > > > > You don't deplete the atomic reserves unnecessarily? > > > > Sounds reasonable. > > > > With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block > > indefinitely till memory can be allocated. > > If you need the "loop indefinitely" behaviour, that's > GFP_NOFS | __GFP_NOFAIL. If you're actually doing something yourself > which can free up memory, this is a bad choice. If you're just sleeping > and retrying, you might as well have the MM do that for you. I probably don't want to wait indefinitely. There might be some cases where I might want to return error to user space. For example, if virtiofs device has been hot-unplugged, then there is no point in waiting indefinitely for memory allocation. Even if memory was allocated, soon we will return error to user space with -ENOTCONN. We are currently not doing that check after memory allocation failure but we probably could as an optimization. So this patch looks good to me as it is. Thanks Hou Tao. Reviewed-by: Vivek Goyal Thanks Vivek
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On Fri, Jan 05, 2024 at 03:41:48PM -0500, Vivek Goyal wrote: > On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote: > > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: > > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: > > > > From: Hou Tao > > > > > > > > When invoking virtio_fs_enqueue_req() through kworker, both the > > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC. > > > > Considering the size of both the sg array and the bounce buffer may be > > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the > > > > possibility of memory allocation failure. > > > > > > > > > > What's the practical benefit of this patch. Looks like if memory > > > allocation fails, we keep retrying at interval of 1ms and don't > > > return error to user space. > > > > You don't deplete the atomic reserves unnecessarily? > > Sounds reasonable. > > With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block > indefinitely till memory can be allocated. If you need the "loop indefinitely" behaviour, that's GFP_NOFS | __GFP_NOFAIL. If you're actually doing something yourself which can free up memory, this is a bad choice. If you're just sleeping and retrying, you might as well have the MM do that for you.
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On Fri, Jan 05, 2024 at 08:21:00PM +, Matthew Wilcox wrote: > On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: > > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: > > > From: Hou Tao > > > > > > When invoking virtio_fs_enqueue_req() through kworker, both the > > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC. > > > Considering the size of both the sg array and the bounce buffer may be > > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the > > > possibility of memory allocation failure. > > > > > > > What's the practical benefit of this patch. Looks like if memory > > allocation fails, we keep retrying at interval of 1ms and don't > > return error to user space. > > You don't deplete the atomic reserves unnecessarily? Sounds reasonable. With GFP_NOFS specificed, can we still get -ENOMEM? Or this will block indefinitely till memory can be allocated. I am trying to figure out with GFP_NOFS, do we still need to check for -ENOMEM while requeuing the req and asking worker thread to retry after 1ms. Thanks Vivek
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On Fri, Jan 05, 2024 at 03:17:19PM -0500, Vivek Goyal wrote: > On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: > > From: Hou Tao > > > > When invoking virtio_fs_enqueue_req() through kworker, both the > > allocation of the sg array and the bounce buffer still use GFP_ATOMIC. > > Considering the size of both the sg array and the bounce buffer may be > > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the > > possibility of memory allocation failure. > > > > What's the practical benefit of this patch. Looks like if memory > allocation fails, we keep retrying at interval of 1ms and don't > return error to user space. You don't deplete the atomic reserves unnecessarily?
Re: [PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
On Fri, Jan 05, 2024 at 06:53:05PM +0800, Hou Tao wrote: > From: Hou Tao > > When invoking virtio_fs_enqueue_req() through kworker, both the > allocation of the sg array and the bounce buffer still use GFP_ATOMIC. > Considering the size of both the sg array and the bounce buffer may be > greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the > possibility of memory allocation failure. > What's the practical benefit of this patch. Looks like if memory allocation fails, we keep retrying at interval of 1ms and don't return error to user space. Thanks Vivek > Signed-off-by: Hou Tao > --- > Change log: > v2: > * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by > Matthew) > > v1: > https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com > > fs/fuse/virtio_fs.c | 20 +++- > 1 file changed, 11 insertions(+), 9 deletions(-) > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index 3aac31d451985..8cf518624ce9e 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c > @@ -87,7 +87,8 @@ struct virtio_fs_req_work { > }; > > static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, > - struct fuse_req *req, bool in_flight); > + struct fuse_req *req, bool in_flight, > + gfp_t gfp); > > static const struct constant_table dax_param_enums[] = { > {"always", FUSE_DAX_ALWAYS }, > @@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct > work_struct *work) > list_del_init(&req->list); > spin_unlock(&fsvq->lock); > > - ret = virtio_fs_enqueue_req(fsvq, req, true); > + ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS); > if (ret < 0) { > if (ret == -ENOMEM || ret == -ENOSPC) { > spin_lock(&fsvq->lock); > @@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct > work_struct *work) > } > > /* Allocate and copy args into req->argbuf */ > -static int copy_args_to_argbuf(struct fuse_req *req) > +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp) > { > struct fuse_args *args = req->args; > unsigned int offset = 0; > @@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req) > len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) + > fuse_len_args(num_out, args->out_args); > > - req->argbuf = kmalloc(len, GFP_ATOMIC); > + req->argbuf = kmalloc(len, gfp); > if (!req->argbuf) > return -ENOMEM; > > @@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct > scatterlist *sg, > > /* Add a request to a virtqueue and kick the device */ > static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, > - struct fuse_req *req, bool in_flight) > + struct fuse_req *req, bool in_flight, > + gfp_t gfp) > { > /* requests need at least 4 elements */ > struct scatterlist *stack_sgs[6]; > @@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq > *fsvq, > /* Does the sglist fit on the stack? */ > total_sgs = sg_count_fuse_req(req); > if (total_sgs > ARRAY_SIZE(stack_sgs)) { > - sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC); > - sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC); > + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp); > + sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp); > if (!sgs || !sg) { > ret = -ENOMEM; > goto out; > @@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq > *fsvq, > } > > /* Use a bounce buffer since stack args cannot be mapped */ > - ret = copy_args_to_argbuf(req); > + ret = copy_args_to_argbuf(req, gfp); > if (ret < 0) > goto out; > > @@ -1245,7 +1247,7 @@ __releases(fiq->lock) >fuse_len_args(req->args->out_numargs, req->args->out_args)); > > fsvq = &fs->vqs[queue_id]; > - ret = virtio_fs_enqueue_req(fsvq, req, false); > + ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC); > if (ret < 0) { > if (ret == -ENOMEM || ret == -ENOSPC) { > /* > -- > 2.29.2 >
Re: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory
On Fri, 05 Jan 2024 12:29:05 -0600, Dave Hansen wrote: There's very little about how the LRU design came to be in this cover letter. Let's add some details. How's this? Writing this up, I'm a lot more convinced that this series is, in general, taking the right approach. I honestly don't see any other alternatives. As much as I'd love to do something stupidly simple like just killing enclaves at the moment they hit the limit, that would be a horrid experience for users _and_ a departure from what the existing reclaim support does. That said, there's still a lot of work do to do refactor this series. It's in need of some love to make it more clear what is going on and to making the eventual switch over to per-cgroup LRUs more gradual. Each patch in the series is still doing way too much, _especially_ in patch 10. == The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. tl;dr: An cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). I'll add this to the cover letter as a section justifying the LRU design for per-cgroup reclaiming. Thank you very much. Haitao
Re: [PATCH v6 00/12] Add Cgroup support for SGX EPC memory
There's very little about how the LRU design came to be in this cover letter. Let's add some details. How's this? Writing this up, I'm a lot more convinced that this series is, in general, taking the right approach. I honestly don't see any other alternatives. As much as I'd love to do something stupidly simple like just killing enclaves at the moment they hit the limit, that would be a horrid experience for users _and_ a departure from what the existing reclaim support does. That said, there's still a lot of work do to do refactor this series. It's in need of some love to make it more clear what is going on and to making the eventual switch over to per-cgroup LRUs more gradual. Each patch in the series is still doing way too much, _especially_ in patch 10. == The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. tl;dr: An cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously).
Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states
On 10/30/23 11:20, Haitao Huang wrote: > @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page > *page) > int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) > { > spin_lock(&sgx_global_lru.lock); > - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { > - /* The page is being reclaimed. */ > - if (list_empty(&page->list)) { > - spin_unlock(&sgx_global_lru.lock); > - return -EBUSY; > - } > - > - list_del(&page->list); > - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; > + if (sgx_epc_page_reclaim_in_progress(page->flags)) { > + spin_unlock(&sgx_global_lru.lock); > + return -EBUSY; > } > + > + list_del(&page->list); > + sgx_epc_page_reset_state(page); I want to know how much if this series is basically line-for-line abstraction shifting like: - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; + sgx_epc_page_reset_state(page); versus actually adding complexity. That way, I might be able to offer some advice on where this can be pared down. That's really hard to do with the current series. Please don't just "introduce new page states". This should have first abstracted out the sgx_epc_page_reclaim_in_progress() operation, using the list_empty() check as the implementation. Then, in a separate patch, introduce the concept of the "reclaim in progress" flag and finally flip the implementation over. Ditto for the sgx_epc_page_reset_state() abstraction. It should have been introduced separately as a concept and then have the implementation changed. On in to patch 10 (which is much too big) which introduces the sgx_lru_list() abstraction.
Re: [PATCH v5 22/34] tracing: Rename ftrace_regs_return_value to ftrace_regs_get_return_value
On Mon, Dec 18, 2023 at 10:15:59PM +0900, Masami Hiramatsu (Google) wrote: > From: Masami Hiramatsu (Google) > > Rename ftrace_regs_return_value to ftrace_regs_get_return_value as same as > other ftrace_regs_get/set_* APIs. > > Signed-off-by: Masami Hiramatsu (Google) Acked-by: Mark Rutland Since this is a trivial cleanup, it might make sense to move this to the start of the series, so that it can be queued even if the later parts need more work. Mark. > --- > Changes in v3: > - Newly added. > --- > arch/loongarch/include/asm/ftrace.h |2 +- > arch/powerpc/include/asm/ftrace.h |2 +- > arch/s390/include/asm/ftrace.h |2 +- > arch/x86/include/asm/ftrace.h |2 +- > include/linux/ftrace.h |2 +- > kernel/trace/fgraph.c |2 +- > 6 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/arch/loongarch/include/asm/ftrace.h > b/arch/loongarch/include/asm/ftrace.h > index a11996eb5892..a9c3d0f2f941 100644 > --- a/arch/loongarch/include/asm/ftrace.h > +++ b/arch/loongarch/include/asm/ftrace.h > @@ -70,7 +70,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs > *fregs, unsigned long ip) > regs_get_kernel_argument(&(fregs)->regs, n) > #define ftrace_regs_get_stack_pointer(fregs) \ > kernel_stack_pointer(&(fregs)->regs) > -#define ftrace_regs_return_value(fregs) \ > +#define ftrace_regs_get_return_value(fregs) \ > regs_return_value(&(fregs)->regs) > #define ftrace_regs_set_return_value(fregs, ret) \ > regs_set_return_value(&(fregs)->regs, ret) > diff --git a/arch/powerpc/include/asm/ftrace.h > b/arch/powerpc/include/asm/ftrace.h > index 9e5a39b6a311..7e138e0e3baf 100644 > --- a/arch/powerpc/include/asm/ftrace.h > +++ b/arch/powerpc/include/asm/ftrace.h > @@ -69,7 +69,7 @@ ftrace_regs_get_instruction_pointer(struct ftrace_regs > *fregs) > regs_get_kernel_argument(&(fregs)->regs, n) > #define ftrace_regs_get_stack_pointer(fregs) \ > kernel_stack_pointer(&(fregs)->regs) > -#define ftrace_regs_return_value(fregs) \ > +#define ftrace_regs_get_return_value(fregs) \ > regs_return_value(&(fregs)->regs) > #define ftrace_regs_set_return_value(fregs, ret) \ > regs_set_return_value(&(fregs)->regs, ret) > diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h > index 5a82b08f03cd..01e775c98425 100644 > --- a/arch/s390/include/asm/ftrace.h > +++ b/arch/s390/include/asm/ftrace.h > @@ -88,7 +88,7 @@ ftrace_regs_set_instruction_pointer(struct ftrace_regs > *fregs, > regs_get_kernel_argument(&(fregs)->regs, n) > #define ftrace_regs_get_stack_pointer(fregs) \ > kernel_stack_pointer(&(fregs)->regs) > -#define ftrace_regs_return_value(fregs) \ > +#define ftrace_regs_get_return_value(fregs) \ > regs_return_value(&(fregs)->regs) > #define ftrace_regs_set_return_value(fregs, ret) \ > regs_set_return_value(&(fregs)->regs, ret) > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h > index 0b306c82855d..a061f8832b20 100644 > --- a/arch/x86/include/asm/ftrace.h > +++ b/arch/x86/include/asm/ftrace.h > @@ -64,7 +64,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs) > regs_get_kernel_argument(&(fregs)->regs, n) > #define ftrace_regs_get_stack_pointer(fregs) \ > kernel_stack_pointer(&(fregs)->regs) > -#define ftrace_regs_return_value(fregs) \ > +#define ftrace_regs_get_return_value(fregs) \ > regs_return_value(&(fregs)->regs) > #define ftrace_regs_set_return_value(fregs, ret) \ > regs_set_return_value(&(fregs)->regs, ret) > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > index 79875a00c02b..da2a23f5a9ed 100644 > --- a/include/linux/ftrace.h > +++ b/include/linux/ftrace.h > @@ -187,7 +187,7 @@ static __always_inline bool ftrace_regs_has_args(struct > ftrace_regs *fregs) > regs_get_kernel_argument(ftrace_get_regs(fregs), n) > #define ftrace_regs_get_stack_pointer(fregs) \ > kernel_stack_pointer(ftrace_get_regs(fregs)) > -#define ftrace_regs_return_value(fregs) \ > +#define ftrace_regs_get_return_value(fregs) \ > regs_return_value(ftrace_get_regs(fregs)) > #define ftrace_regs_set_return_value(fregs, ret) \ > regs_set_return_value(ftrace_get_regs(fregs), ret) > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c > index 088432b695a6..9a60acaacc96 100644 > --- a/kernel/trace/fgraph.c > +++ b/kernel/trace/fgraph.c > @@ -783,7 +783,7 @@ static void fgraph_call_retfunc(struct ftrace_regs *fregs, > trace.rettime = trace_clock_local(); > #ifdef CONFIG_FUNCTION_GRAPH_RETVAL > if (fregs) > - trace.retval = ftrace_regs_return_value(fregs); > + trace.retval = ftrace_regs_get_return_value(fregs); > else > trace.retval = fgraph_ret_regs_return_value(ret_regs); > #endif >
Re: [PATCH v5 01/34] tracing: Add a comment about ftrace_regs definition
On Mon, Dec 18, 2023 at 10:11:44PM +0900, Masami Hiramatsu (Google) wrote: > From: Masami Hiramatsu (Google) > > To clarify what will be expected on ftrace_regs, add a comment to the > architecture independent definition of the ftrace_regs. > > Signed-off-by: Masami Hiramatsu (Google) Acked-by: Mark Rutland Mark. > --- > Changes in v3: > - Add instruction pointer > Changes in v2: > - newly added. > --- > include/linux/ftrace.h | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > index e8921871ef9a..8b48fc621ea0 100644 > --- a/include/linux/ftrace.h > +++ b/include/linux/ftrace.h > @@ -118,6 +118,32 @@ extern int ftrace_enabled; > > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > > +/** > + * ftrace_regs - ftrace partial/optimal register set > + * > + * ftrace_regs represents a group of registers which is used at the > + * function entry and exit. There are three types of registers. > + * > + * - Registers for passing the parameters to callee, including the stack > + * pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64) > + * - Registers for passing the return values to caller. > + * (e.g. rax and rdx on x86_64) > + * - Registers for hooking the function call and return including the > + * frame pointer (the frame pointer is architecture/config dependent) > + * (e.g. rip, rbp and rsp for x86_64) > + * > + * Also, architecture dependent fields can be used for internal process. > + * (e.g. orig_ax on x86_64) > + * > + * On the function entry, those registers will be restored except for > + * the stack pointer, so that user can change the function parameters > + * and instruction pointer (e.g. live patching.) > + * On the function exit, only registers which is used for return values > + * are restored. > + * > + * NOTE: user *must not* access regs directly, only do it via APIs, because > + * the member can be changed according to the architecture. > + */ > struct ftrace_regs { > struct pt_regs regs; > }; >
Re: [PATCH v5 11/34] function_graph: Have the instances use their own ftrace_ops for filtering
On Mon, Dec 18, 2023 at 10:13:46PM +0900, Masami Hiramatsu (Google) wrote: > From: Steven Rostedt (VMware) > > Allow for instances to have their own ftrace_ops part of the fgraph_ops > that makes the funtion_graph tracer filter on the set_ftrace_filter file > of the instance and not the top instance. > > This also change how the function_graph handles multiple instances on the > shadow stack. Previously we use ARRAY type entries to record which one > is enabled, and this makes it a bitmap of the fgraph_array's indexes. > Previous function_graph_enter() expects calling back from > prepare_ftrace_return() function which is called back only once if it is > enabled. But this introduces different ftrace_ops for each fgraph > instance and those are called from ftrace_graph_func() one by one. Thus > we can not loop on the fgraph_array(), and need to reuse the ret_stack > pushed by the previous instance. Finding the ret_stack is easy because > we can check the ret_stack->func. But that is not enough for the self- > recursive tail-call case. Thus fgraph uses the bitmap entry to find it > is already set (this means that entry is for previous tail call). > > Signed-off-by: Steven Rostedt (VMware) > Signed-off-by: Masami Hiramatsu (Google) As a heads-up, while testing the topic/fprobe-on-fgraph branch on arm64, I get a warning which bisets down to this commit: | Testing tracer function_graph: | [ cut here ] | WARNING: CPU: 2 PID: 0 at arch/arm64/kernel/stacktrace.c:84 arch_stack_walk+0x3c0/0x3d8 | Modules linked in: | CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc2-00026-gea1e68a341c2 #12 | Hardware name: linux,dummy-virt (DT) | pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : arch_stack_walk+0x3c0/0x3d8 | lr : arch_stack_walk+0x260/0x3d8 | sp : 80008318be00 | x29: 80008318be00 x28: 03c0ae80 x27: | x26: x25: 03c0ae80 x24: | x23: 8000800234c8 x22: 80008002dc30 x21: 800080035d10 | x20: 80008318bee8 x19: 800080023460 x18: 800083453c68 | x17: x16: 800083188000 x15: 8ccc5058 | x14: 0004 x13: 800082b8c4f0 x12: | x11: 800081fba9b0 x10: 80008318bff0 x9 : 800080010798 | x8 : 80008002dc30 x7 : 03c0ae80 x6 : | x5 : x4 : 8000832a3c18 x3 : 80008318bff0 | x2 : 80008002dc30 x1 : 80008002dc30 x0 : 80008002dc30 | Call trace: | arch_stack_walk+0x3c0/0x3d8 | return_address+0x40/0x80 | trace_hardirqs_on+0x8c/0x198 | __do_softirq+0xe8/0x440 | ---[ end trace ]--- That's a warning in arm64's unwind_recover_return_address() function, which fires when ftrace_graph_ret_addr() finds return_to_handler: if (state->task->ret_stack && (state->pc == (unsigned long)return_to_handler)) { unsigned long orig_pc; orig_pc = ftrace_graph_ret_addr(state->task, NULL, state->pc, (void *)state->fp); if (WARN_ON_ONCE(state->pc == orig_pc)) return -EINVAL; state->pc = orig_pc; } The rationale there is that since tail calls are (currently) disabled on arm64, the only reason for ftrace_graph_ret_addr() to return return_to_handler is when it fails to find the original return address. Does this change make it legitimate for ftrace_graph_ret_addr() to return return_to_handler in other cases, or is that a bug? Either way, we'll need *some* way to recover the original return addresss... Mark. > --- > Changes in v4: > - Simplify get_ret_stack() sanity check and use WARN_ON_ONCE() for > obviously wrong value. > - Do not check ret == return_to_handler but always read the previous > ret_stack in ftrace_push_return_trace() to check it is reusable. > - Set the bit 0 of the bitmap entry always in function_graph_enter() > because it uses bit 0 to check re-usability. > - Fix to ensure the ret_stack entry is bitmap type when checking the > bitmap. > Changes in v3: > - Pass current fgraph_ops to the new entry handler >(function_graph_enter_ops) if fgraph use ftrace. > - Add fgraph_ops::idx in this patch. > - Replace the array type with the bitmap type so that it can record > which fgraph is called. > - Fix some helper function to use passed task_struct instead of current. > - Reduce the ret-index size to 1024 words. > - Make the ret-index directly points the ret_stack. > - Fix ftrace_graph_ret_addr() to handle tail-call case correctly. > Changes in v2: > - Use ftrace_graph_func and FTRACE_OPS_GRAPH_STUB instead of > ftrace_stub and FTRACE_OPS_FL_STUB for new ftrace based fgraph. > --- > arch/arm64/kernel/ftrace.c | 19 ++ > arch/x86/kernel/ftrace.c | 19 ++ > include/linux/ftrace.h |7 + > kernel/tra
Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote: > The page frag in vhost_net_page_frag_refill() uses the > 'struct page_frag' from skb_page_frag_refill(), but it's > implementation is similar to page_frag_alloc_align() now. > > This patch removes vhost_net_page_frag_refill() by using > 'struct page_frag_cache' instead of 'struct page_frag', > and allocating frag using page_frag_alloc_align(). > > The added benefit is that not only unifying the page frag > implementation a little, but also having about 0.5% performance > boost testing by using the vhost_net_test introduced in the > last patch. > > Signed-off-by: Yunsheng Lin > Acked-by: Jason Wang > --- > drivers/vhost/net.c | 93 ++--- > 1 file changed, 29 insertions(+), 64 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index e574e21cc0ca..805e11d598e4 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -141,10 +141,8 @@ struct vhost_net { > unsigned tx_zcopy_err; > /* Flush in progress. Protected by tx vq lock. */ > bool tx_flush; > - /* Private page frag */ > - struct page_frag page_frag; > - /* Refcount bias of page frag */ > - int refcnt_bias; > + /* Private page frag cache */ > + struct page_frag_cache pf_cache; > }; > > static unsigned vhost_net_zcopy_mask __read_mostly; > @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, > size_t total_len) > !vhost_vq_avail_empty(vq->dev, vq); > } > > -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int > sz, > -struct page_frag *pfrag, gfp_t gfp) > -{ > - if (pfrag->page) { > - if (pfrag->offset + sz <= pfrag->size) > - return true; > - __page_frag_cache_drain(pfrag->page, net->refcnt_bias); > - } > - > - pfrag->offset = 0; > - net->refcnt_bias = 0; > - if (SKB_FRAG_PAGE_ORDER) { > - /* Avoid direct reclaim but allow kswapd to wake */ > - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > - __GFP_COMP | __GFP_NOWARN | > - __GFP_NORETRY | __GFP_NOMEMALLOC, > - SKB_FRAG_PAGE_ORDER); > - if (likely(pfrag->page)) { > - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > - goto done; > - } > - } > - pfrag->page = alloc_page(gfp); > - if (likely(pfrag->page)) { > - pfrag->size = PAGE_SIZE; > - goto done; > - } > - return false; > - > -done: > - net->refcnt_bias = USHRT_MAX; > - page_ref_add(pfrag->page, USHRT_MAX - 1); > - return true; > -} > - > #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD) > > static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, > @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue > *nvq, > struct vhost_net *net = container_of(vq->dev, struct vhost_net, >dev); > struct socket *sock = vhost_vq_get_backend(vq); > - struct page_frag *alloc_frag = &net->page_frag; > struct virtio_net_hdr *gso; > struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp]; > struct tun_xdp_hdr *hdr; > @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue > *nvq, > int sock_hlen = nvq->sock_hlen; > void *buf; > int copied; > + int ret; > > if (unlikely(len < nvq->sock_hlen)) > return -EFAULT; > @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct > vhost_net_virtqueue *nvq, > return -ENOSPC; > > buflen += SKB_DATA_ALIGN(len + pad); > - alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES); > - if (unlikely(!vhost_net_page_frag_refill(net, buflen, > - alloc_frag, GFP_KERNEL))) > + buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL, > + SMP_CACHE_BYTES); If your changes from patch 1 are just to make it fit into this layout might I suggest just splitting up page_frag_alloc_align into an inline that accepts the arguments you have here, and adding __page_frag_alloc_align which is passed the mask the original function expected. By doing that you should be able to maintain the same level of performance and still get most of the code cleanup. > + if (unlikely(!buf)) > return -ENOMEM; > > - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > - copied = copy_page_from_iter(alloc_frag->page, > - alloc_frag->offset + > - offsetof(struct tun_xdp_hdr, gso), > - sock_hlen, from); > - if (copied != soc
[PATCH v2 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec
Let's avoid some of the boilerplate code to manage the vcodec PM domains, by converting into using dev_pm_domain_attach|detach_list(). Cc: Mauro Carvalho Chehab Cc: Stanimir Varbanov Cc: Vikash Garodia Cc: "Bryan O'Donoghue" Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Tested-by: Bryan O'Donoghue Reviewed-by: Bryan O'Donoghue Signed-off-by: Ulf Hansson --- Changes in v2: - Added tags Bryan's tags. --- drivers/media/platform/qcom/venus/core.c | 12 +++-- drivers/media/platform/qcom/venus/core.h | 7 ++- .../media/platform/qcom/venus/pm_helpers.c| 48 +++ 3 files changed, 26 insertions(+), 41 deletions(-) diff --git a/drivers/media/platform/qcom/venus/core.c b/drivers/media/platform/qcom/venus/core.c index 9cffe975581b..bd9b474280e4 100644 --- a/drivers/media/platform/qcom/venus/core.c +++ b/drivers/media/platform/qcom/venus/core.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -114,7 +115,8 @@ static void venus_sys_error_handler(struct work_struct *work) pm_runtime_put_sync(core->dev); for (i = 0; i < max_attempts; i++) { - if (!core->pmdomains[0] || !pm_runtime_active(core->pmdomains[0])) + if (!core->pmdomains || + !pm_runtime_active(core->pmdomains->pd_devs[0])) break; usleep_range(1000, 1500); } @@ -705,7 +707,7 @@ static const struct venus_resources sdm845_res_v2 = { .vcodec0_clks = { "vcodec0_core", "vcodec0_bus" }, .vcodec1_clks = { "vcodec1_core", "vcodec1_bus" }, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0", "vcodec1" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0", "vcodec1" }, .vcodec_pmdomains_num = 3, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 2, @@ -754,7 +756,7 @@ static const struct venus_resources sc7180_res = { .clks_num = 3, .vcodec0_clks = { "vcodec0_core", "vcodec0_bus" }, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 1, @@ -811,7 +813,7 @@ static const struct venus_resources sm8250_res = { .resets_num = 2, .vcodec0_clks = { "vcodec0_core" }, .vcodec_clks_num = 1, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "mx", NULL }, .vcodec_num = 1, @@ -870,7 +872,7 @@ static const struct venus_resources sc7280_res = { .clks_num = 3, .vcodec0_clks = {"vcodec_core", "vcodec_bus"}, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 1, diff --git a/drivers/media/platform/qcom/venus/core.h b/drivers/media/platform/qcom/venus/core.h index 4a633261ece4..7ef341bf21cc 100644 --- a/drivers/media/platform/qcom/venus/core.h +++ b/drivers/media/platform/qcom/venus/core.h @@ -25,7 +25,6 @@ #define VIDC_CLKS_NUM_MAX 4 #define VIDC_VCODEC_CLKS_NUM_MAX 2 -#define VIDC_PMDOMAINS_NUM_MAX 3 #define VIDC_RESETS_NUM_MAX2 extern int venus_fw_debug; @@ -72,7 +71,7 @@ struct venus_resources { const char * const vcodec0_clks[VIDC_VCODEC_CLKS_NUM_MAX]; const char * const vcodec1_clks[VIDC_VCODEC_CLKS_NUM_MAX]; unsigned int vcodec_clks_num; - const char * const vcodec_pmdomains[VIDC_PMDOMAINS_NUM_MAX]; + const char **vcodec_pmdomains; unsigned int vcodec_pmdomains_num; const char **opp_pmdomain; unsigned int vcodec_num; @@ -134,7 +133,7 @@ struct venus_format { * @video_path: an interconnect handle to video to/from memory path * @cpucfg_path: an interconnect handle to cpu configuration path * @has_opp_table: does OPP table exist - * @pmdomains: an array of pmdomains struct device pointers + * @pmdomains: a pointer to a list of pmdomains * @opp_dl_venus: an device-link for device OPP * @opp_pmdomain: an OPP power-domain * @resets: an array of reset signals @@ -187,7 +186,7 @@ struct venus_core { struct icc_path *video_path; struct icc_path *cpucfg_path; bool has_opp_table; - struct device *pmdomains[VIDC_PMDOMAINS_NUM_MAX]; + struct dev_pm_domain_list *pmdomains; struct device_link *opp_dl_venus; struct device *opp_pmdomain; struct reset_control *resets[VIDC_RESETS_NUM_MAX]; diff --git a/drivers/media/platform/qcom/venus/pm_helpers.c b/drivers/media/platform/qcom/venus/pm_helpers.c i
[PATCH v2 4/5] remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list()
Let's avoid some of the boilerplate code to manage the various PM domain cases, by converting into using dev_pm_domain_attach|detach_list(). As a part of the conversion, we are moving over to use device_links, which simplifies the runtime PM support too. Moreover, while attaching let's trust that an already attached single PM domain is the correct one. Cc: Mathieu Poirier Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Signed-off-by: Ulf Hansson --- Changes in v2: - None. Kind regards Ulf Hansson --- drivers/remoteproc/qcom_q6v5_adsp.c | 160 +--- 1 file changed, 73 insertions(+), 87 deletions(-) diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c b/drivers/remoteproc/qcom_q6v5_adsp.c index 6c67514cc493..93f9a1537ec6 100644 --- a/drivers/remoteproc/qcom_q6v5_adsp.c +++ b/drivers/remoteproc/qcom_q6v5_adsp.c @@ -55,8 +55,6 @@ #define QDSP6SS_CORE_CBCR 0x20 #define QDSP6SS_SLEEP_CBCR 0x3c -#define QCOM_Q6V5_RPROC_PROXY_PD_MAX 3 - #define LPASS_BOOT_CORE_START BIT(0) #define LPASS_BOOT_CMD_START BIT(0) #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0 @@ -74,7 +72,8 @@ struct adsp_pil_data { const char **clk_ids; int num_clks; - const char **proxy_pd_names; + const char **pd_names; + unsigned int num_pds; const char *load_state; }; @@ -110,8 +109,7 @@ struct qcom_adsp { size_t mem_size; bool has_iommu; - struct device *proxy_pds[QCOM_Q6V5_RPROC_PROXY_PD_MAX]; - size_t proxy_pd_count; + struct dev_pm_domain_list *pd_list; struct qcom_rproc_glink glink_subdev; struct qcom_rproc_ssr ssr_subdev; @@ -120,98 +118,92 @@ struct qcom_adsp { int (*shutdown)(struct qcom_adsp *adsp); }; -static int qcom_rproc_pds_attach(struct device *dev, struct qcom_adsp *adsp, -const char **pd_names) +static int qcom_rproc_pds_attach(struct qcom_adsp *adsp, const char **pd_names, +unsigned int num_pds) { - struct device **devs = adsp->proxy_pds; - size_t num_pds = 0; + struct device *dev = adsp->dev; + struct dev_pm_domain_attach_data pd_data = { + .pd_names = pd_names, + .num_pd_names = num_pds, + }; int ret; - int i; - - if (!pd_names) - return 0; /* Handle single power domain */ - if (dev->pm_domain) { - devs[0] = dev; - pm_runtime_enable(dev); - return 1; - } + if (dev->pm_domain) + goto out; - while (pd_names[num_pds]) - num_pds++; + if (!pd_names) + return 0; - if (num_pds > ARRAY_SIZE(adsp->proxy_pds)) - return -E2BIG; + ret = dev_pm_domain_attach_list(dev, &pd_data, &adsp->pd_list); + if (ret < 0) + return ret; - for (i = 0; i < num_pds; i++) { - devs[i] = dev_pm_domain_attach_by_name(dev, pd_names[i]); - if (IS_ERR_OR_NULL(devs[i])) { - ret = PTR_ERR(devs[i]) ? : -ENODATA; - goto unroll_attach; - } - } +out: + pm_runtime_enable(dev); + return 0; +} - return num_pds; +static void qcom_rproc_pds_detach(struct qcom_adsp *adsp) +{ + struct device *dev = adsp->dev; + struct dev_pm_domain_list *pds = adsp->pd_list; -unroll_attach: - for (i--; i >= 0; i--) - dev_pm_domain_detach(devs[i], false); + dev_pm_domain_detach_list(pds); - return ret; + if (dev->pm_domain || pds) + pm_runtime_disable(adsp->dev); } -static void qcom_rproc_pds_detach(struct qcom_adsp *adsp, struct device **pds, - size_t pd_count) +static int qcom_rproc_pds_enable(struct qcom_adsp *adsp) { struct device *dev = adsp->dev; - int i; + struct dev_pm_domain_list *pds = adsp->pd_list; + int ret, i = 0; - /* Handle single power domain */ - if (dev->pm_domain && pd_count) { - pm_runtime_disable(dev); - return; - } + if (!dev->pm_domain && !pds) + return 0; - for (i = 0; i < pd_count; i++) - dev_pm_domain_detach(pds[i], false); -} + if (dev->pm_domain) + dev_pm_genpd_set_performance_state(dev, INT_MAX); -static int qcom_rproc_pds_enable(struct qcom_adsp *adsp, struct device **pds, -size_t pd_count) -{ - int ret; - int i; - - for (i = 0; i < pd_count; i++) { - dev_pm_genpd_set_performance_state(pds[i], INT_MAX); - ret = pm_runtime_resume_and_get(pds[i]); - if (ret < 0) { - dev_pm_genpd_set_performance_state(pds[i], 0); - goto unroll_pd_votes; - } + while (pds && i < pds->num_pds) { +
[PATCH v2 3/5] remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()
Let's avoid the boilerplate code to manage the multiple PM domain case, by converting into using dev_pm_domain_attach|detach_list(). Cc: Mathieu Poirier Cc: Bjorn Andersson Cc: Shawn Guo Cc: Sascha Hauer Cc: Iuliana Prodan Cc: Daniel Baluta Cc: Signed-off-by: Ulf Hansson --- Changes in v2: - None. Iuliana/Daniel I am ccing you to request help with test/review of this change. Note that, you will need patch 1/5 in the series too, to be able to test this. Kind regards Ulf Hansson --- drivers/remoteproc/imx_rproc.c | 73 +- 1 file changed, 9 insertions(+), 64 deletions(-) diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c index 8bb293b9f327..3161f14442bc 100644 --- a/drivers/remoteproc/imx_rproc.c +++ b/drivers/remoteproc/imx_rproc.c @@ -92,7 +92,6 @@ struct imx_rproc_mem { static int imx_rproc_xtr_mbox_init(struct rproc *rproc); static void imx_rproc_free_mbox(struct rproc *rproc); -static int imx_rproc_detach_pd(struct rproc *rproc); struct imx_rproc { struct device *dev; @@ -113,10 +112,8 @@ struct imx_rproc { u32 rproc_pt; /* partition id */ u32 rsrc_id;/* resource id */ u32 entry; /* cpu start address */ - int num_pd; u32 core_index; - struct device **pd_dev; - struct device_link **pd_dev_link; + struct dev_pm_domain_list *pd_list; }; static const struct imx_rproc_att imx_rproc_att_imx93[] = { @@ -853,7 +850,7 @@ static void imx_rproc_put_scu(struct rproc *rproc) return; if (imx_sc_rm_is_resource_owned(priv->ipc_handle, priv->rsrc_id)) { - imx_rproc_detach_pd(rproc); + dev_pm_domain_detach_list(priv->pd_list); return; } @@ -880,72 +877,20 @@ static int imx_rproc_partition_notify(struct notifier_block *nb, static int imx_rproc_attach_pd(struct imx_rproc *priv) { struct device *dev = priv->dev; - int ret, i; - - /* -* If there is only one power-domain entry, the platform driver framework -* will handle it, no need handle it in this driver. -*/ - priv->num_pd = of_count_phandle_with_args(dev->of_node, "power-domains", - "#power-domain-cells"); - if (priv->num_pd <= 1) - return 0; - - priv->pd_dev = devm_kmalloc_array(dev, priv->num_pd, sizeof(*priv->pd_dev), GFP_KERNEL); - if (!priv->pd_dev) - return -ENOMEM; - - priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_pd, sizeof(*priv->pd_dev_link), - GFP_KERNEL); - - if (!priv->pd_dev_link) - return -ENOMEM; - - for (i = 0; i < priv->num_pd; i++) { - priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i); - if (IS_ERR(priv->pd_dev[i])) { - ret = PTR_ERR(priv->pd_dev[i]); - goto detach_pd; - } - - priv->pd_dev_link[i] = device_link_add(dev, priv->pd_dev[i], DL_FLAG_STATELESS | - DL_FLAG_PM_RUNTIME | DL_FLAG_RPM_ACTIVE); - if (!priv->pd_dev_link[i]) { - dev_pm_domain_detach(priv->pd_dev[i], false); - ret = -EINVAL; - goto detach_pd; - } - } - - return 0; - -detach_pd: - while (--i >= 0) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return ret; -} - -static int imx_rproc_detach_pd(struct rproc *rproc) -{ - struct imx_rproc *priv = rproc->priv; - int i; + int ret; + struct dev_pm_domain_attach_data pd_data = { + .pd_flags = PD_FLAG_DEV_LINK_ON, + }; /* * If there is only one power-domain entry, the platform driver framework * will handle it, no need handle it in this driver. */ - if (priv->num_pd <= 1) + if (dev->pm_domain) return 0; - for (i = 0; i < priv->num_pd; i++) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return 0; + ret = dev_pm_domain_attach_list(dev, &pd_data, &priv->pd_list); + return ret < 0 ? ret : 0; } static int imx_rproc_detect_mode(struct imx_rproc *priv) -- 2.34.1
[PATCH v2 2/5] remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list()
Let's avoid the boilerplate code to manage the multiple PM domain case, by converting into using dev_pm_domain_attach|detach_list(). Cc: Mathieu Poirier Cc: Bjorn Andersson Cc: Shawn Guo Cc: Sascha Hauer Cc: Iuliana Prodan Cc: Daniel Baluta Cc: Signed-off-by: Ulf Hansson --- Changes in v2: - None. Iuliana/Daniel I am ccing you to request help with test/review of this change. Note that, you will need patch 1/5 in the series too, to be able to test this. Kind regards Ulf Hansson --- drivers/remoteproc/imx_dsp_rproc.c | 82 -- 1 file changed, 9 insertions(+), 73 deletions(-) diff --git a/drivers/remoteproc/imx_dsp_rproc.c b/drivers/remoteproc/imx_dsp_rproc.c index 8fcda9b74545..0409b7c47d5c 100644 --- a/drivers/remoteproc/imx_dsp_rproc.c +++ b/drivers/remoteproc/imx_dsp_rproc.c @@ -103,12 +103,10 @@ enum imx_dsp_rp_mbox_messages { * @tx_ch: mailbox tx channel handle * @rx_ch: mailbox rx channel handle * @rxdb_ch: mailbox rx doorbell channel handle - * @pd_dev: power domain device - * @pd_dev_link: power domain device link + * @pd_list: power domain list * @ipc_handle: System Control Unit ipc handle * @rproc_work: work for processing virtio interrupts * @pm_comp: completion primitive to sync for suspend response - * @num_domains: power domain number * @flags: control flags */ struct imx_dsp_rproc { @@ -121,12 +119,10 @@ struct imx_dsp_rproc { struct mbox_chan*tx_ch; struct mbox_chan*rx_ch; struct mbox_chan*rxdb_ch; - struct device **pd_dev; - struct device_link **pd_dev_link; + struct dev_pm_domain_list *pd_list; struct imx_sc_ipc *ipc_handle; struct work_struct rproc_work; struct completion pm_comp; - int num_domains; u32 flags; }; @@ -954,74 +950,14 @@ static const struct rproc_ops imx_dsp_rproc_ops = { static int imx_dsp_attach_pm_domains(struct imx_dsp_rproc *priv) { struct device *dev = priv->rproc->dev.parent; - int ret, i; - - priv->num_domains = of_count_phandle_with_args(dev->of_node, - "power-domains", - "#power-domain-cells"); - - /* If only one domain, then no need to link the device */ - if (priv->num_domains <= 1) - return 0; - - priv->pd_dev = devm_kmalloc_array(dev, priv->num_domains, - sizeof(*priv->pd_dev), - GFP_KERNEL); - if (!priv->pd_dev) - return -ENOMEM; - - priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_domains, - sizeof(*priv->pd_dev_link), - GFP_KERNEL); - if (!priv->pd_dev_link) - return -ENOMEM; - - for (i = 0; i < priv->num_domains; i++) { - priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i); - if (IS_ERR(priv->pd_dev[i])) { - ret = PTR_ERR(priv->pd_dev[i]); - goto detach_pm; - } - - /* -* device_link_add will check priv->pd_dev[i], if it is -* NULL, then will break. -*/ - priv->pd_dev_link[i] = device_link_add(dev, - priv->pd_dev[i], - DL_FLAG_STATELESS | - DL_FLAG_PM_RUNTIME); - if (!priv->pd_dev_link[i]) { - dev_pm_domain_detach(priv->pd_dev[i], false); - ret = -EINVAL; - goto detach_pm; - } - } - - return 0; - -detach_pm: - while (--i >= 0) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return ret; -} - -static int imx_dsp_detach_pm_domains(struct imx_dsp_rproc *priv) -{ - int i; + int ret; - if (priv->num_domains <= 1) + /* A single PM domain is already attached. */ + if (dev->pm_domain) return 0; - for (i = 0; i < priv->num_domains; i++) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return 0; + ret = dev_pm_domain_attach_list(dev, NULL, &priv->pd_list); + return ret < 0 ? ret : 0; } /** @@ -1153,7 +1089,7 @@ static int imx_dsp_rproc_probe(struct platform_device *pdev) return 0; err_detach_
[PATCH v2 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains
Attaching/detaching of a device to multiple PM domains has started to become a common operation for many drivers, typically during ->probe() and ->remove(). In most cases, this has lead to lots of boilerplate code in the drivers. To fixup up the situation, let's introduce a pair of helper functions, dev_pm_domain_attach|detach_list(), that driver can use instead of the open-coding. Note that, it seems reasonable to limit the support for these helpers to DT based platforms, at it's the only valid use case for now. Signed-off-by: Ulf Hansson --- Changes in v2: - Fix NULL pointer bug pointed out by Nikunj. --- drivers/base/power/common.c | 134 include/linux/pm_domain.h | 38 ++ 2 files changed, 172 insertions(+) diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c index 44ec20918a4d..327d168dd37a 100644 --- a/drivers/base/power/common.c +++ b/drivers/base/power/common.c @@ -167,6 +167,115 @@ struct device *dev_pm_domain_attach_by_name(struct device *dev, } EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name); +/** + * dev_pm_domain_attach_list - Associate a device with its PM domains. + * @dev: The device used to lookup the PM domains for. + * @data: The data used for attaching to the PM domains. + * @list: An out-parameter with an allocated list of attached PM domains. + * + * This function helps to attach a device to its multiple PM domains. The + * caller, which is typically a driver's probe function, may provide a list of + * names for the PM domains that we should try to attach the device to, but it + * may also provide an empty list, in case the attach should be done for all of + * the available PM domains. + * + * Callers must ensure proper synchronization of this function with power + * management callbacks. + * + * Returns the number of attached PM domains or a negative error code in case of + * a failure. Note that, to detach the list of PM domains, the driver shall call + * dev_pm_domain_detach_list(), typically during the remove phase. + */ +int dev_pm_domain_attach_list(struct device *dev, + const struct dev_pm_domain_attach_data *data, + struct dev_pm_domain_list **list) +{ + struct device_node *np = dev->of_node; + struct dev_pm_domain_list *pds; + struct device *pd_dev = NULL; + int ret, i, num_pds = 0; + bool by_id = true; + u32 pd_flags = data ? data->pd_flags : 0; + u32 link_flags = pd_flags & PD_FLAG_NO_DEV_LINK ? 0 : + DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME; + + if (dev->pm_domain) + return -EEXIST; + + /* For now this is limited to OF based platforms. */ + if (!np) + return 0; + + if (data && data->pd_names) { + num_pds = data->num_pd_names; + by_id = false; + } else { + num_pds = of_count_phandle_with_args(np, "power-domains", +"#power-domain-cells"); + } + + if (num_pds <= 0) + return 0; + + pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL); + if (!pds) + return -ENOMEM; + + pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs), + GFP_KERNEL); + if (!pds->pd_devs) + return -ENOMEM; + + pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links), +GFP_KERNEL); + if (!pds->pd_links) + return -ENOMEM; + + if (link_flags && pd_flags & PD_FLAG_DEV_LINK_ON) + link_flags |= DL_FLAG_RPM_ACTIVE; + + for (i = 0; i < num_pds; i++) { + if (by_id) + pd_dev = dev_pm_domain_attach_by_id(dev, i); + else + pd_dev = dev_pm_domain_attach_by_name(dev, + data->pd_names[i]); + if (IS_ERR_OR_NULL(pd_dev)) { + ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV; + goto err_attach; + } + + if (link_flags) { + struct device_link *link; + + link = device_link_add(dev, pd_dev, link_flags); + if (!link) { + ret = -ENODEV; + goto err_link; + } + + pds->pd_links[i] = link; + } + + pds->pd_devs[i] = pd_dev; + } + + pds->num_pds = num_pds; + *list = pds; + return num_pds; + +err_link: + dev_pm_domain_detach(pd_dev, true); +err_attach: + while (--i >= 0) { + if (pds->pd_links[i]) + device_link_del(pds->pd_links[i]); + dev_pm_domain_detach(pds->pd_devs[i], true); + } + re
[PATCH v2 0/5] PM: domains: Add helpers for multi PM domains to avoid open-coding
Updates in v2: - Ccing Daniel Baluta and Iuliana Prodan the NXP remoteproc patches to requests help with testing. - Fixed NULL pointer bug in patch1, pointed out by Nikunj. - Added some tested/reviewed-by tags. Attaching/detaching of a device to multiple PM domains has started to become a common operation for many drivers, typically during ->probe() and ->remove(). In most cases, this has lead to lots of boilerplate code in the drivers. This series adds a pair of helper functions to manage the attach/detach of a device to its multiple PM domains. Moreover, a couple of drivers have been converted to use the new helpers as a proof of concept. Note 1) The changes in the drivers have only been compile tested, while the helpers have been tested along with a couple of local dummy drivers that I have hacked up to model both genpd providers and genpd consumers. Note 2) I was struggling to make up mind if we should have a separate helper to attach all available power-domains described in DT, rather than providing "NULL" to the dev_pm_domain_attach_list(). I decided not to, but please let me know if you prefer the other option. Note 3) For OPP integration, as a follow up I am striving to make the dev_pm_opp_attach_genpd() redundant. Instead I think we should move towards using dev_pm_opp_set_config()->_opp_set_required_devs(), which would allow us to use the helpers that $subject series is adding. Kind regards Ulf Hansson Ulf Hansson (5): PM: domains: Add helper functions to attach/detach multiple PM domains remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list() remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list() remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list() media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec drivers/base/power/common.c | 134 +++ drivers/media/platform/qcom/venus/core.c | 12 +- drivers/media/platform/qcom/venus/core.h | 7 +- .../media/platform/qcom/venus/pm_helpers.c| 48 ++ drivers/remoteproc/imx_dsp_rproc.c| 82 + drivers/remoteproc/imx_rproc.c| 73 +--- drivers/remoteproc/qcom_q6v5_adsp.c | 160 -- include/linux/pm_domain.h | 38 + 8 files changed, 289 insertions(+), 265 deletions(-) -- 2.34.1
Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote: > When draining a page_frag_cache, most user are doing > the similar steps, so introduce an API to avoid code > duplication. > > Signed-off-by: Yunsheng Lin > Acked-by: Jason Wang Looks good to me. Reviewed-by: Alexander Duyck
Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote: > Currently there seems to be three page frag implementions > which all try to allocate order 3 page, if that fails, it > then fail back to allocate order 0 page, and each of them > all allow order 3 page allocation to fail under certain > condition by using specific gfp bits. > > The gfp bits for order 3 page allocation are different > between different implementation, __GFP_NOMEMALLOC is > or'd to forbid access to emergency reserves memory for > __page_frag_cache_refill(), but it is not or'd in other > implementions, __GFP_DIRECT_RECLAIM is masked off to avoid > direct reclaim in skb_page_frag_refill(), but it is not > masked off in __page_frag_cache_refill(). > > This patch unifies the gfp bits used between different > implementions by or'ing __GFP_NOMEMALLOC and masking off > __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid > possible pressure for mm. > > Signed-off-by: Yunsheng Lin > CC: Alexander Duyck > --- > drivers/vhost/net.c | 2 +- > mm/page_alloc.c | 4 ++-- > net/core/sock.c | 2 +- > 3 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index f2ed7167c848..e574e21cc0ca 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net > *net, unsigned int sz, > /* Avoid direct reclaim but allow kswapd to wake */ > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > __GFP_COMP | __GFP_NOWARN | > - __GFP_NORETRY, > + __GFP_NORETRY | __GFP_NOMEMALLOC, > SKB_FRAG_PAGE_ORDER); > if (likely(pfrag->page)) { > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9a16305cf985..1f0b36dd81b5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct > page_frag_cache *nc, > gfp_t gfp = gfp_mask; > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY | > - __GFP_NOMEMALLOC; > + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | > +__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; > page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, > PAGE_FRAG_CACHE_MAX_ORDER); > nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE; > diff --git a/net/core/sock.c b/net/core/sock.c > index 446e945f736b..d643332c3ee5 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct > page_frag *pfrag, gfp_t gfp) > /* Avoid direct reclaim but allow kswapd to wake */ > pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | > __GFP_COMP | __GFP_NOWARN | > - __GFP_NORETRY, > + __GFP_NORETRY | __GFP_NOMEMALLOC, > SKB_FRAG_PAGE_ORDER); > if (likely(pfrag->page)) { > pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; Looks fine to me. One thing you may want to consider would be to place this all in an inline function that could just consolidate all the code. Reviewed-by: Alexander Duyck
Re: [RFC PATCH 4/5] regulator: add 88pm88x regulators driver
On Thu, Dec 28, 2023 at 10:39:13AM +0100, Karel Balej wrote: > @@ -68,6 +68,21 @@ static struct mfd_cell pm886_devs[] = { > .num_resources = ARRAY_SIZE(pm88x_onkey_resources), > .resources = pm88x_onkey_resources, > }, > + { > + .name = "88pm88x-regulator", > + .id = PM88X_REGULATOR_ID_LDO2, > + .of_compatible = "marvell,88pm88x-regulator", > + }, Why are we adding an of_compatible here? It's redundant, the MFD split is a feature of Linux internals not of the hardware, and the existing 88pm8xx MFD doesn't use them. signature.asc Description: PGP signature
Re: [PATCH v1 2/5] livepatch: Add klp-convert tool
On Mon 2023-11-06 17:25:10, Lukas Hruska wrote: > Livepatches need to access external symbols which can't be handled > by the normal relocation mechanism. It is needed for two types > of symbols: > > --- /dev/null > +++ b/scripts/livepatch/klp-convert.c > @@ -0,0 +1,283 @@ [...] > +/* > + * Formats name of klp rela symbol based on another given section (@oldsec) > + * and object (@obj_name) name, then returns it > + */ > +static char *alloc_klp_rela_name(struct section *oldsec, > + char *target_objname, struct elf *klp_elf) > +{ Nit: Please, use @lp_obj_name instead of @target_objname. It would make it consistent with the caller: klp_rela_name = alloc_klp_rela_name(oldsec, lp_obj_name, klp_elf); and also with #define KLP_RELOC_SYMBOL_POS(LP_OBJ_NAME, SYM_OBJ_NAME, SYM_NAME, SYM_POS) > + char *klp_rela_name; > + unsigned int length; > + int err; Just for record. The name "lp_obj_name" came from a discussion between me and Lukas about the KLP_RELOC_SYMBOL_POS() parameter names. The macro has two object name parameters. And LP_OBJ_NAME aka LivePatched_OBJ_NAME looked better than TARGET_OBJ_NAME. The name "TARGET" is too ambiguous to me. Best Regards, Petr
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Fri, 5 Jan 2024 15:26:28 +0100 Christian Brauner wrote: > On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > Instead of walking the dentries on mount/remount to update the gid values of > > all the dentries if a gid option is specified on mount, just update the root > > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > > operations to update the permissions of the files and directories. > > > > For all files and directories in the top level instance: > > > > /sys/kernel/tracing/* > > > > It will use the root inode as the default permissions. The inode that > > represents: /sys/kernel/tracing (or wherever it is mounted). > > > > When an instance is created: > > > > mkdir /sys/kernel/tracing/instance/foo > > > > The directory "foo" and all its files and directories underneath will use > > the default of what foo is when it was created. A remount of tracefs will > > not affect it. > > That kinda sounds like eventfs should actually be a separate filesystem. > But I don't know enough about the relationship between the two concepts. It may someday become one, as eventfs is used by perf where the rest of the tracefs system is not. > > > > > If a user were to modify the permissions of any file or directory in > > tracefs, it will also no longer be modified by a change in ownership of a > > remount. > > Very odd semantics and I would recommend to avoid that. It's just plain > weird imo. > > > > > The events directory, if it is in the top level instance, will use the > > tracefs root inode as the default ownership for itself and all the files and > > directories below it. > > > > For the events directory in an instance ("foo"), it will keep the ownership > > of what it was when it was created, and that will be used as the default > > ownership for the files and directories beneath it. > > > > Link: > > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > > > Signed-off-by: Steven Rostedt (Google) > > --- > > So tracefs supports remounting with different uid/gid mount options and > then actually wades through _all_ of the inodes and changes their > ownership internally? What's the use-case for this? Containers? No, in fact tracing doesn't work well with containers as tracing is global to the entire machine. It can work with privileged containers though. The reason for this is because tracefs was based off of debugfs where the files and directores are created at boot up and mounted later. The reason to do this was to allow users to mount with gid=GID to allow a given group to have access to tracing. Without this update, tracefs would ignore it like debugfs and proc does today. I think its time I explain the purpose of tracefs and how it came to be. The tracing system required a way to control tracing and read the traces. It could have just used a new system like perf (although /sys/kernel/debug/tracing predates perf), where it created a single ioctl() like system call do do everything. As the ftrace tracing came from PREEMPT_RT latency tracer and my own logdev tracer, which both have an embedded background, I chose an interface that could work with just an unmodified version of busybox. That is, I wanted it to work with just cat and echo. The main difference with tracefs compared to other file systems is that it is a control interface, where writes happen as much as reads. The data read is controlled. The closest thing I can think of is how cgroups work. As tracing is a privileged operation, but something that could be changed to allow a group to have access to, I wanted to make it easy for an admin to decide who gets to do what at boot up via the /etc/fstab file. > > Aside from optimizing this and the special semantics for this eventfs > stuff that you really should think twice of doing, here's one idea for > an extension that might alleviate some of the pain: > > If you need flexible dynamic ownership change to e.g., be able to > delegate (all, a directory, a single file of) tracefs to > unprivileged/containers/whatever then you might want to consider > supporting idmapped mounts for tracefs. Because then you can do stuff > like: I'm a novice here and have no idea on how id maps work ;-) > > user1@localhost:~/data/scripts$ sudo mount --bind -o > X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt > user1@localhost:~/data/scripts$ ls -ln /run/ > total 12 > drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials > drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup > drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus > drwx-- 6 0 0 280 Jan 5 11:57 incus_agent > prw--- 1 0 00 Jan 5 11:57 initctl > drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock > drwxr-xr-x 3 0 0 60 Jan 5 11:57 log > drwx-- 2 0 0 40 Jan 5 11:57 lvm > -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id > -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic > drwxr-xr-x
[PATCH 2/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals
Configure the thermals for the PA_THERM1, MSM_THERM, PA_THERM0, RFC_CAM_THERM, CAM_FLASH_THERM and QUIET_THERM thermistors connected to PM6150L. Due to hardware constraints we can only register 4 zones with pm6150l_adc_tm, the other 2 we can register via generic-adc-thermal. The trip points can really only be considered as placeholders, more configuration with cooling etc. can be added later. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 191 ++ 1 file changed, 191 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts index b7ccfe4011bb..6f435a7ed855 100644 --- a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts +++ b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts @@ -84,6 +84,20 @@ memory@efe01000 { }; }; + msm_therm_sensor: thermal-sensor-msm { + compatible = "generic-adc-thermal"; + #thermal-sensor-cells = <0>; + io-channels = <&pm6150l_adc ADC5_AMUX_THM2_100K_PU>; + io-channel-names = "sensor-channel"; + }; + + rear_cam_sensor: thermal-sensor-rear-cam { + compatible = "generic-adc-thermal"; + #thermal-sensor-cells = <0>; + io-channels = <&pm6150l_adc ADC5_GPIO2_100K_PU>; + io-channel-names = "sensor-channel"; + }; + thermal-zones { chg-skin-thermal { polling-delay-passive = <0>; @@ -113,6 +127,90 @@ active-config0 { }; }; + pa0-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + thermal-sensors = <&pm6150l_adc_tm 1>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + + pa1-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + thermal-sensors = <&pm6150l_adc_tm 0>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + + quiet-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + thermal-sensors = <&pm6150l_adc_tm 3>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + + rear-cam-thermal { + polling-delay-passive = <1000>; + polling-delay = <5000>; + thermal-sensors = <&rear_cam_sensor>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + + rfc-flash-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + thermal-sensors = <&pm6150l_adc_tm 2>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + + sdm-skin-thermal { + polling-delay-passive = <1000>; + polling-delay = <5000>; + thermal-sensors = <&msm_therm_sensor>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; + xo-thermal { polling-delay-passive = <0>;
[PATCH 1/2] arm64: dts: qcom: sm7225-fairphone-fp4: Add PMK8003 thermals
Configure the thermals for the XO_THERM thermistor connected to the PMK8003 (which is called PMK8350 in software). The ADC configuration for PMK8350_ADC7_AMUX_THM1_100K_PU has already been added in the past. The trip points can really only be considered as placeholders, more configuration with cooling etc. can be added later. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 25 +++ 1 file changed, 25 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts index ade619805519..b7ccfe4011bb 100644 --- a/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts +++ b/arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts @@ -112,6 +112,20 @@ active-config0 { }; }; }; + + xo-thermal { + polling-delay-passive = <0>; + polling-delay = <0>; + thermal-sensors = <&pmk8350_adc_tm 0>; + + trips { + active-config0 { + temperature = <125000>; + hysteresis = <1000>; + type = "passive"; + }; + }; + }; }; }; @@ -490,6 +504,17 @@ conn-therm@1 { }; }; +&pmk8350_adc_tm { + status = "okay"; + + xo-therm@0 { + reg = <0>; + io-channels = <&pmk8350_vadc PMK8350_ADC7_AMUX_THM1_100K_PU>; + qcom,ratiometric; + qcom,hw-settle-time-us = <200>; + }; +}; + &pmk8350_rtc { status = "okay"; }; -- 2.43.0
[PATCH 0/2] More thermal configuration for Fairphone 4
Add the thermal configuration for the thermistors connected to PMK8003 and PM6150L. With that all the external thermistors on the phone should be present in the dts. Signed-off-by: Luca Weiss --- Luca Weiss (2): arm64: dts: qcom: sm7225-fairphone-fp4: Add PMK8003 thermals arm64: dts: qcom: sm7225-fairphone-fp4: Add PM6150L thermals arch/arm64/boot/dts/qcom/sm7225-fairphone-fp4.dts | 216 ++ 1 file changed, 216 insertions(+) --- base-commit: 50c107ee601a3d57e4fb13a9ecd9cf6c8df187f2 change-id: 20240105-fp4-thermals-f3c8e0417b70 Best regards, -- Luca Weiss
Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote: > Hi Dave, > > On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen > wrote: > > > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your > > thoughts? Or could you confirm we should not > > > do reclaim per cgroup at all? > > What's the benefit of doing reclaim per cgroup? Is that worth the extra > > complexity? > > > > Without reclaiming per cgroup, then we have to always set the limit to > enclave's peak usage. This may not be efficient utilization as in many cases > each enclave can perform fine with EPC limit set less than peak. Basically > each group can not give up some pages for greater good without dying :-) +1. this is exactly my thinking too. The per cgroup reclaiming is important for the containers use case we are working on. I also think it makes the limit more meaningful: the per-container pool of EPC pages to use (which is independent of the enclave size). > > Also with enclaves enabled with EDMM, the peak usage is not static so hard > to determine upfront. Hence it might be an operation/deployment > inconvenience. > > In case of over-committing (sum of limits > total capacity), one cgroup at > peak usage may require swapping pages out in a different cgroup if system is > overloaded at that time. > > > The key question here is whether we want the SGX VM to be complex and > > more like the real VM or simple when a cgroup hits its limit. Right? > > > > Although it's fair to say the majority of complexity of this series is in > support for reclaiming per cgroup, I think it's manageable and much less > than real VM after we removed the enclave killing parts: the only extra > effort is to track pages in separate list and reclaim them in separately as > opposed to track in on global list and reclaim together. The main reclaiming > loop code is still pretty much the same as before. > > > > If stopping at patch 5 and having less code is even remotely an option, > > why not do _that_? > > > I hope I described limitations clear enough above. > If those are OK with users and also make it acceptable for merge quickly, You explained the gaps very well already. I don't think the simple version without per-cgroup reclaiming is enough for the container case. Mikko
Re: [PATCH] tracefs/eventfs: Use root and instance inodes as default ownership
On Wed, Jan 03, 2024 at 08:32:46PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Instead of walking the dentries on mount/remount to update the gid values of > all the dentries if a gid option is specified on mount, just update the root > inode. Add .getattr, .setattr, and .permissions on the tracefs inode > operations to update the permissions of the files and directories. > > For all files and directories in the top level instance: > > /sys/kernel/tracing/* > > It will use the root inode as the default permissions. The inode that > represents: /sys/kernel/tracing (or wherever it is mounted). > > When an instance is created: > > mkdir /sys/kernel/tracing/instance/foo > > The directory "foo" and all its files and directories underneath will use > the default of what foo is when it was created. A remount of tracefs will > not affect it. That kinda sounds like eventfs should actually be a separate filesystem. But I don't know enough about the relationship between the two concepts. > > If a user were to modify the permissions of any file or directory in > tracefs, it will also no longer be modified by a change in ownership of a > remount. Very odd semantics and I would recommend to avoid that. It's just plain weird imo. > > The events directory, if it is in the top level instance, will use the > tracefs root inode as the default ownership for itself and all the files and > directories below it. > > For the events directory in an instance ("foo"), it will keep the ownership > of what it was when it was created, and that will be used as the default > ownership for the files and directories beneath it. > > Link: > https://lore.kernel.org/linux-trace-kernel/CAHk-=wjvdgkjdxbbvln2wbznqp4ush46e3gqj9m7ug6dpx2...@mail.gmail.com/ > > Signed-off-by: Steven Rostedt (Google) > --- So tracefs supports remounting with different uid/gid mount options and then actually wades through _all_ of the inodes and changes their ownership internally? What's the use-case for this? Containers? Aside from optimizing this and the special semantics for this eventfs stuff that you really should think twice of doing, here's one idea for an extension that might alleviate some of the pain: If you need flexible dynamic ownership change to e.g., be able to delegate (all, a directory, a single file of) tracefs to unprivileged/containers/whatever then you might want to consider supporting idmapped mounts for tracefs. Because then you can do stuff like: user1@localhost:~/data/scripts$ sudo mount --bind -o X-mount.idmap='g:0:1000:1 u:0:1234:1' /run/ /mnt user1@localhost:~/data/scripts$ ls -ln /run/ total 12 drwxr-xr-x 2 0 0 40 Jan 5 12:12 credentials drwx-- 2 0 0 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 0 0 60 Jan 5 11:57 dbus drwx-- 6 0 0 280 Jan 5 11:57 incus_agent prw--- 1 0 00 Jan 5 11:57 initctl drwxrwxrwt 4 0 0 80 Jan 5 11:57 lock drwxr-xr-x 3 0 0 60 Jan 5 11:57 log drwx-- 2 0 0 40 Jan 5 11:57 lvm -r--r--r-- 1 0 0 33 Jan 5 11:57 machine-id -rw-r--r-- 1 0 0 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 0 0 40 Jan 5 11:57 mount drwx-- 2 0 0 40 Jan 5 11:57 multipath drwxr-xr-x 2 0 0 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 0 08 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 0 0 40 Jan 5 11:57 sudo drwxr-xr-x 24 0 0 660 Jan 5 14:30 systemd drwxr-xr-x 6 0 0 140 Jan 5 14:30 udev drwxr-xr-x 4 0 0 80 Jan 5 11:58 user -rw-rw-r-- 1 0 43 2304 Jan 5 15:15 utmp user1@localhost:~/data/scripts$ ls -ln /mnt/ total 12 drwxr-xr-x 2 1234 1000 40 Jan 5 12:12 credentials drwx-- 2 1234 1000 40 Jan 5 11:57 cryptsetup drwxr-xr-x 2 1234 1000 60 Jan 5 11:57 dbus drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 incus_agent prw--- 1 1234 10000 Jan 5 11:57 initctl drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 lock drwxr-xr-x 3 1234 1000 60 Jan 5 11:57 log drwx-- 2 1234 1000 40 Jan 5 11:57 lvm -r--r--r-- 1 1234 1000 33 Jan 5 11:57 machine-id -rw-r--r-- 1 1234 1000 101 Jan 5 11:58 motd.dynamic drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 mount drwx-- 2 1234 1000 40 Jan 5 11:57 multipath drwxr-xr-x 2 1234 1000 40 Jan 5 11:57 sendsigs.omit.d lrwxrwxrwx 1 1234 10008 Jan 5 11:57 shm -> /dev/shm drwx--x--x 2 1234 1000 40 Jan 5 11:57 sudo drwxr-xr-x 24 1234 1000 660 Jan 5 14:30 systemd drwxr-xr-x 6 1234 1000 140 Jan 5 14:30 udev drwxr-xr-x 4 1234 1000 80 Jan 5 11:58 user -rw-rw-r-- 1 1234 65534 2304 Jan 5 15:15 utmp Where you can see that ownership of this tmpfs instance in this example is changed. I'm not trying to advocate here but this will probably ultimately be nicer for your users because it means that a container manager or whatever can be handed a part of tracefs (or all of it) and the ownership and access rights for that thing is correct. And you can get rid of that gid based access completely. You can change uids, gids, or both. You can spe
Re: [PATCH v1 5/5] documentation: Update on livepatch elf format
On Mon 2023-11-06 17:25:13, Lukas Hruska wrote: > Add a section to Documentation/livepatch/module-elf-format.rst > describing how klp-convert works for fixing relocations. > > Signed-off-by: Lukas Hruska Looks good to me: Reviewed-by: Petr Mladek Best Regards, Petr
Re: [PATCH v1 4/5] livepatch: Add sample livepatch module
On Mon 2023-11-06 17:25:12, Lukas Hruska wrote: > From: Josh Poimboeuf > > Add a new livepatch sample in samples/livepatch/ to make use of symbols > that must be post-processed to enable load-time relocation resolution. > As the new sample is to be used as an example, it is annotated with > KLP_RELOC_SYMBOL macro. > > The livepatch sample updates the function cmdline_proc_show to print the > string referenced by the symbol saved_command_line appended by the > string "livepatch=1". > > Signed-off-by: Josh Poimboeuf > Signed-off-by: Lukas Hruska > --- > samples/livepatch/Makefile| 1 + > .../livepatch/livepatch-annotated-sample.c| 84 +++ The name is ambiguous. I would use something like livepatch-extern-symbol.c Also it would be great to prepare a selftest. In this case, I would suggest to livepatch a symbol from another test module so that it does not modify the running system and the result is predictable. Otherwise it looks good. With a better module name: Reviewed-by: Petr Mladek Best Regards, Petr
Re: [PATCH v1 3/5] kbuild/modpost: integrate klp-convert
On Mon 2023-11-06 17:25:11, Lukas Hruska wrote: > From: Josh Poimboeuf > > Update the modpost program so that it does not warn about unresolved > symbols matching the expected format which will be then resolved by > klp-convert. This in only one part. The main change is that it klp-convert is called for livepatch modules after the final linking. I would write something like: Call klp-convert for the livepatch modules after the final linking. Also update the modpost tool so that it does not warn about unresolved symbols matching the expected format which will be then resolved by klp-convert. > Signed-off-by: Josh Poimboeuf > Signed-off-by: Lukas Hruska Otherwise the code looks good. With the updated commit message: Reviewed-by: Petr Mladek Best Regards, Petr
Re: [PATCH v1 2/5] livepatch: Add klp-convert tool
On Mon 2023-11-06 17:25:10, Lukas Hruska wrote: > Livepatches need to access external symbols which can't be handled > by the normal relocation mechanism. It is needed for two types > of symbols: > > + Symbols which can be local for the original livepatched function. > The alternative implementation in the livepatch sees them > as external symbols. > > + Symbols in modules which are exported via EXPORT_SYMBOL*(). They > must be handled special way otherwise the livepatch module would > depend on the livepatched one. Loading such livepatch would cause > loading the other module as well. > > The address of these symbols can be found via kallsyms. Or they can Please, remove the extra space at the end of the line. > be relocated using livepatch specific relocation sections as specified > in Documentation/livepatch/module-elf-format.txt. > > Currently, there is no trivial way to embed the required information as > requested in the final livepatch elf object. klp-convert solves this > problem by using annotations in the elf object to convert the relocation > accordingly to the specification, enabling it to be handled by the > livepatch loader. > > Given the above, create scripts/livepatch to hold tools developed for > livepatches and add source files for klp-convert there. > > Allow to annotate such external symbols in the livepatch by a macro > KLP_RELOC_SYMBOL(). It will create symbol with all needed > metadata. For example: > > extern char *saved_command_line \ > KLP_RELOC_SYMBOL(vmlinux, vmlinux, saved_command_line, 0); > > would create symbol > > $>readelf -r -W : > Relocation section '.rela.text' at offset 0x32e60 contains 10 entries: > Offset Info Type Symbol's Value > Symbol's Name + Addend > [...] > 0068 003c0002 R_X86_64_PC32 > .klp.sym.rela.vmlinux.vmlinux.saved_command_line,0 - 4 > [...] > > > Also add scripts/livepatch/klp-convert. The tool transforms symbols > created by KLP_RELOC_SYMBOL() to object specific rela sections > and rela entries which would later be proceed when the livepatch > or the livepatched object is loaded. > > For example, klp-convert would replace the above symbols with: s/above symbols/above symbol/ > $> readelf -r -W > Relocation section '.klp.rela.vmlinux.text' at offset 0x5cb60 contains 1 > entry: > Offset Info Type Symbol's Value > Symbol's Name + Addend > 0068 003c0002 R_X86_64_PC32 > .klp.sym.vmlinux.saved_command_line,0 - 4 > > klp-convert relies on libelf and on a list implementation. Add files > scripts/livepatch/elf.c and scripts/livepatch/elf.h, which are a libelf > interfacing layer and scripts/livepatch/list.h, which is a list > implementation. > > Update Makefiles to correctly support the compilation of the new tool, > update MAINTAINERS file and add a .gitignore file. > > --- > MAINTAINERS | 1 + > include/linux/livepatch.h | 19 + > scripts/Makefile| 1 + > scripts/livepatch/.gitignore| 1 + > scripts/livepatch/Makefile | 5 + > scripts/livepatch/elf.c | 817 > scripts/livepatch/elf.h | 73 +++ I see a similar code in tools/objtool/elf.c tools/objtool/include/objtool/elf.h Both variants have been written by Josh. I wonder if we could share one implementation. Josh? > scripts/livepatch/klp-convert.c | 283 +++ > scripts/livepatch/klp-convert.h | 42 ++ > scripts/livepatch/list.h| 391 +++ And probably also the list.h > 10 files changed, 1633 insertions(+) > create mode 100644 scripts/livepatch/.gitignore > create mode 100644 scripts/livepatch/Makefile > create mode 100644 scripts/livepatch/elf.c > create mode 100644 scripts/livepatch/elf.h > create mode 100644 scripts/livepatch/klp-convert.c > create mode 100644 scripts/livepatch/klp-convert.h > create mode 100644 scripts/livepatch/list.h > > --- /dev/null > +++ b/scripts/livepatch/klp-convert.c > @@ -0,0 +1,283 @@ [...] > +/* Converts rela symbol names */ > +static bool convert_symbol(struct symbol *s) > +{ > + char lp_obj_name[MODULE_NAME_LEN]; > + char sym_obj_name[MODULE_NAME_LEN]; > + char sym_name[KSYM_NAME_LEN]; > + char *klp_sym_name; > + unsigned long sym_pos; > + int poslen; > + unsigned int length; > + > + static_assert(MODULE_NAME_LEN <= 56, "Update limit in the below > sscanf()"); IMHO, there should be "< 56" instead of "<= 56". The sscanf is limited by %55. Also we should check KSYM_NAME_LEN. Similar to to check in klp_resolve_symbols() static_assert(MODULE_NAME_LEN < 56 || KSYM_NAME_LEN != 512, "Update limit in the below sscanf()"); > + > + if (sscanf(s->name, KLP_SYM_RELA_PREFIX "%55[^.].%55[^.].%511[^,],%lu", > +
[PATCH v9] bus: mhi: host: Add tracing support
This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Change the implementation of the arrays which has enum to strings mapping to make it consistent in both trace header file and other files. Where ever the trace events are added, debug messages are removed. Signed-off-by: Krishna chaitanya chundru Reviewed-by: "Steven Rostedt (Google)" --- Changes in v9: - Change the implementations of some array so that the strings to enum mapping - is same in both trace header and other files as suggested by steve. - Link to v8: https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com Changes in v7: - change log format as pointed by mani. - Link to v6: https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com Changes in v6: - use 'rp' directly as suggested by jeffrey. - Link to v5: https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com Changes in v5: - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. - Instead of converting to u64 to print address, use %px to print the address to avoid - warnings in some platforms. - Link to v4: https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com Changes in v4: - Fix compilation issues in previous patch which happended due to rebasing. - In the defconfig FTRACE config is not enabled due to that the compilation issue is not - seen in my workspace. - Link to v3: https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com Changes in v3: - move trace header file from include/trace/events to drivers/bus/mhi/host/ so that - we can include driver header files. - Use macros directly in the trace events as suggested Jeffrey Hugo. - Reorder the structure in the events as suggested by steve to avoid holes in the buffer. - removed the mhi_to_physical function as this can give security issues. - removed macros to define strings as we can get those from driver headers. - Link to v2: https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com Changes in v2: - Passing the raw state into the trace event and using __print_symbolic() as suggested by bjorn. - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. - Fixed the kernel test rebot issues. - Link to v1: https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com --- drivers/bus/mhi/common.h| 38 +++--- drivers/bus/mhi/host/init.c | 63 + drivers/bus/mhi/host/internal.h | 40 ++ drivers/bus/mhi/host/main.c | 19 ++- drivers/bus/mhi/host/pm.c | 7 +- drivers/bus/mhi/host/trace.h| 275 6 files changed, 378 insertions(+), 64 deletions(-) diff --git a/drivers/bus/mhi/common.h b/drivers/bus/mhi/common.h index f794b9c8049e..dda340aaed95 100644 --- a/drivers/bus/mhi/common.h +++ b/drivers/bus/mhi/common.h @@ -297,30 +297,30 @@ struct mhi_ring_element { __le32 dword[2]; }; +#define MHI_STATE_LIST \ + mhi_state(RESET,"RESET")\ + mhi_state(READY,"READY")\ + mhi_state(M0, "M0") \ + mhi_state(M1, "M1") \ + mhi_state(M2, "M2") \ + mhi_state(M3, "M3") \ + mhi_state(M3_FAST, "M3_FAST") \ + mhi_state(BHI, "BHI") \ + mhi_state_end(SYS_ERR, "SYS ERROR") + +#undef mhi_state +#undef mhi_state_end + +#define mhi_state(a, b)case MHI_STATE_##a: return b; +#define mhi_state_end(a, b)case MHI_STATE_##a: return b; + static inline const char *mhi_state_str(enum mhi_state state) { switch (state) { - case MHI_STATE_RESET: - return "RESET"; - case MHI_STATE_READY: - return "READY"; - case MHI_STATE_M0: - return "M0"; - case MHI_STATE_M1: - return "M1"; - case MHI_STATE_M2: - return "M2"; - case MHI_STATE_M3: - return "M3"; - case MHI_STATE_M3_FAST: - return "M3 FAST"; - case MHI_STATE_BHI: - return "BHI"; - case MHI_STATE_SYS_ERR: - return "SYS ERROR"; + MHI_STATE_LIST default: return "Unknown state"; } -}; +} #endif /* _MHI_COMMON_H */ diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c in
Re: [PATCH v8] bus: mhi: host: Add tracing support
On 1/5/2024 10:47 AM, Baochen Qiang wrote: On 1/4/2024 12:47 PM, Krishna Chaitanya Chundru wrote: Hi Steven, Can you please review this. Thanks & Regards, Krishna Chaitanya. On 12/7/2023 10:00 AM, Krishna chaitanya chundru wrote: This change adds ftrace support for following functions which helps in debugging the issues when there is Channel state & MHI state change and also when we receive data and control events: 1. mhi_intvec_mhi_states 2. mhi_process_data_event_ring 3. mhi_process_ctrl_ev_ring 4. mhi_gen_tre 5. mhi_update_channel_state 6. mhi_tryset_pm_state 7. mhi_pm_st_worker Where ever the trace events are added, debug messages are removed. Is there a reason why debug messages have to be removed? From the view of MHI controller, we often need MHI logs to debug, so is it possbile to preserve those logs? The trace is being replaced by the debug message and it would be preferred to have one mechanism or the other, not both. you will still see these logs in traces. - Krishna Chaitanya. Signed-off-by: Krishna chaitanya chundru --- Changes in v8: - Pass the structure and derefernce the variables in TP_fast_assign as suggested by steve - Link to v7: https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com Changes in v7: - change log format as pointed by mani. - Link to v6: https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com Changes in v6: - use 'rp' directly as suggested by jeffrey. - Link to v5: https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com Changes in v5: - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. - Instead of converting to u64 to print address, use %px to print the address to avoid - warnings in some platforms. - Link to v4: https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com Changes in v4: - Fix compilation issues in previous patch which happended due to rebasing. - In the defconfig FTRACE config is not enabled due to that the compilation issue is not - seen in my workspace. - Link to v3: https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com Changes in v3: - move trace header file from include/trace/events to drivers/bus/mhi/host/ so that - we can include driver header files. - Use macros directly in the trace events as suggested Jeffrey Hugo. - Reorder the structure in the events as suggested by steve to avoid holes in the buffer. - removed the mhi_to_physical function as this can give security issues. - removed macros to define strings as we can get those from driver headers. - Link to v2: https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com Changes in v2: - Passing the raw state into the trace event and using __print_symbolic() as suggested by bjorn. - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. - Fixed the kernel test rebot issues. - Link to v1: https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com --- drivers/bus/mhi/host/init.c | 3 + drivers/bus/mhi/host/main.c | 19 ++-- drivers/bus/mhi/host/pm.c | 7 +- drivers/bus/mhi/host/trace.h | 205 +++ 4 files changed, 221 insertions(+), 13 deletions(-) diff --git a/drivers/bus/mhi/host/init.c b/drivers/bus/mhi/host/init.c index f78aefd2d7a3..6acb85f4c5f8 100644 --- a/drivers/bus/mhi/host/init.c +++ b/drivers/bus/mhi/host/init.c @@ -20,6 +20,9 @@ #include #include "internal.h" +#define CREATE_TRACE_POINTS +#include "trace.h" + static DEFINE_IDA(mhi_controller_ida); const char * const mhi_ee_str[MHI_EE_MAX] = { diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c index dcf627b36e82..189f4786403e 100644 --- a/drivers/bus/mhi/host/main.c +++ b/drivers/bus/mhi/host/main.c @@ -15,6 +15,7 @@ #include #include #include "internal.h" +#include "trace.h" int __must_check mhi_read_reg(struct mhi_controller *mhi_cntrl, void __iomem *base, u32 offset, u32 *out) @@ -491,11 +492,8 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv) state = mhi_get_mhi_state(mhi_cntrl); ee = mhi_get_exec_env(mhi_cntrl); - dev_dbg(dev, "local ee: %s state: %s device ee: %s state: %s\n", - TO_MHI_EXEC_STR(mhi_cntrl->ee), - mhi_state_str(mhi_cntrl->dev_state), - TO_MHI_EXEC_STR(ee), mhi_state_str(state)); + trace_mhi_intvec_states(mhi_cntrl, ee, state); if (state == MHI_STATE_SYS_ERR) { dev_dbg(dev, "System error detected\n"); pm_state = mhi_tryset_pm_state(mhi_cntrl, @@ -832,6 +830,8 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller *mhi_cntrl, while (dev_rp != local_rp) { enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp); + trace_mhi_ctrl_event(mhi_cntrl, local_rp); + switch (type) { case MHI_PKT_TYPE_BW_REQ_EVENT: { @@ -997,6 +
[PATCH v2] virtiofs: use GFP_NOFS when enqueuing request through kworker
From: Hou Tao When invoking virtio_fs_enqueue_req() through kworker, both the allocation of the sg array and the bounce buffer still use GFP_ATOMIC. Considering the size of both the sg array and the bounce buffer may be greater than PAGE_SIZE, use GFP_NOFS instead of GFP_ATOMIC to lower the possibility of memory allocation failure. Signed-off-by: Hou Tao --- Change log: v2: * pass gfp_t instead of bool to virtio_fs_enqueue_req() (Suggested by Matthew) v1: https://lore.kernel.org/linux-fsdevel/20240104015805.2103766-1-hou...@huaweicloud.com fs/fuse/virtio_fs.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 3aac31d451985..8cf518624ce9e 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -87,7 +87,8 @@ struct virtio_fs_req_work { }; static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, -struct fuse_req *req, bool in_flight); +struct fuse_req *req, bool in_flight, +gfp_t gfp); static const struct constant_table dax_param_enums[] = { {"always", FUSE_DAX_ALWAYS }, @@ -383,7 +384,7 @@ static void virtio_fs_request_dispatch_work(struct work_struct *work) list_del_init(&req->list); spin_unlock(&fsvq->lock); - ret = virtio_fs_enqueue_req(fsvq, req, true); + ret = virtio_fs_enqueue_req(fsvq, req, true, GFP_NOFS); if (ret < 0) { if (ret == -ENOMEM || ret == -ENOSPC) { spin_lock(&fsvq->lock); @@ -488,7 +489,7 @@ static void virtio_fs_hiprio_dispatch_work(struct work_struct *work) } /* Allocate and copy args into req->argbuf */ -static int copy_args_to_argbuf(struct fuse_req *req) +static int copy_args_to_argbuf(struct fuse_req *req, gfp_t gfp) { struct fuse_args *args = req->args; unsigned int offset = 0; @@ -502,7 +503,7 @@ static int copy_args_to_argbuf(struct fuse_req *req) len = fuse_len_args(num_in, (struct fuse_arg *) args->in_args) + fuse_len_args(num_out, args->out_args); - req->argbuf = kmalloc(len, GFP_ATOMIC); + req->argbuf = kmalloc(len, gfp); if (!req->argbuf) return -ENOMEM; @@ -1119,7 +1120,8 @@ static unsigned int sg_init_fuse_args(struct scatterlist *sg, /* Add a request to a virtqueue and kick the device */ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, -struct fuse_req *req, bool in_flight) +struct fuse_req *req, bool in_flight, +gfp_t gfp) { /* requests need at least 4 elements */ struct scatterlist *stack_sgs[6]; @@ -1140,8 +1142,8 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, /* Does the sglist fit on the stack? */ total_sgs = sg_count_fuse_req(req); if (total_sgs > ARRAY_SIZE(stack_sgs)) { - sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), GFP_ATOMIC); - sg = kmalloc_array(total_sgs, sizeof(sg[0]), GFP_ATOMIC); + sgs = kmalloc_array(total_sgs, sizeof(sgs[0]), gfp); + sg = kmalloc_array(total_sgs, sizeof(sg[0]), gfp); if (!sgs || !sg) { ret = -ENOMEM; goto out; @@ -1149,7 +1151,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, } /* Use a bounce buffer since stack args cannot be mapped */ - ret = copy_args_to_argbuf(req); + ret = copy_args_to_argbuf(req, gfp); if (ret < 0) goto out; @@ -1245,7 +1247,7 @@ __releases(fiq->lock) fuse_len_args(req->args->out_numargs, req->args->out_args)); fsvq = &fs->vqs[queue_id]; - ret = virtio_fs_enqueue_req(fsvq, req, false); + ret = virtio_fs_enqueue_req(fsvq, req, false, GFP_ATOMIC); if (ret < 0) { if (ret == -ENOMEM || ret == -ENOSPC) { /* -- 2.29.2
Re: [PATCH v1 1/5] livepatch: Create and include UAPI headers
On Mon 2023-11-06 17:25:09, Lukas Hruska wrote: > From: Josh Poimboeuf > > Define klp prefixes in include/uapi/linux/livepatch.h, and use them for > replacing hard-coded values in kernel/livepatch/core.c. > > Signed-off-by: Josh Poimboeuf > Signed-off-by: Lukas Hruska Reviewed-by: Petr Mladek Best Regards, Petr
Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested
On 1/5/24 10:59, Eugenio Perez Martin wrote: On Fri, Jan 5, 2024 at 9:12 AM Maxime Coquelin wrote: On 1/5/24 03:45, Jason Wang wrote: On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin wrote: Virtio-net driver control queue implementation is not safe when used with VDUSE. If the VDUSE application does not reply to control queue messages, it currently ends up hanging the kernel thread sending this command. Some work is on-going to make the control queue implementation robust with VDUSE. Until it is completed, let's fail features check if any control-queue related feature is requested. Signed-off-by: Maxime Coquelin --- drivers/vdpa/vdpa_user/vduse_dev.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 0486ff672408..94f54ea2eb06 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "iova_domain.h" @@ -46,6 +47,15 @@ #define IRQ_UNBOUND -1 +#define VDUSE_NET_INVALID_FEATURES_MASK \ + (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\ +BIT_ULL(VIRTIO_NET_F_CTRL_RX) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) | \ +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \ +BIT_ULL(VIRTIO_NET_F_MQ) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) | \ +BIT_ULL(VIRTIO_NET_F_RSS)) We need to make this as well: VIRTIO_NET_F_CTRL_GUEST_OFFLOADS I missed it, and see others have been added in the Virtio spec repository (BTW, I see this specific one is missing in the dependency list [0], I will submit a patch). I wonder if it is not just simpler to just check for VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out, the VDUSE driver won't be the one violating the spec so it should be good? It will avoid having to update the mask if new features depending on it are added (or forgetting to update it). WDYT? I think it is safer to work with a whitelist, instead of a blacklist. As any new feature might require code changes in QEMU. Is that possible? Well, that's how it was done in previous revision. :) I changed to a blacklist for consistency with block device's WCE feature check after Jason's comment. I'm not sure moving back to a whitelist brings much advantages when compared to the effort of keeping it up to date. Just blacklisting VIRTIO_NET_F_CTRL_VQ is enough in my opinion. Thanks, Maxime Thanks, Maxime [0]: https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129 Other than this, Acked-by: Jason Wang Thanks + struct vduse_virtqueue { u16 index; u16 num_max; @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config *config) if ((config->device_id == VIRTIO_ID_BLOCK) && (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))) return false; + else if ((config->device_id == VIRTIO_ID_NET) && + (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) + return false; return true; } -- 2.43.0
Re: [PATCH v6 3/3] vduse: enable Virtio-net device type
On Thu, Jan 4, 2024 at 4:39 PM Maxime Coquelin wrote: > > This patch adds Virtio-net device type to the supported > devices types. > > Initialization fails if the device does not support > VIRTIO_F_VERSION_1 feature, in order to guarantee the > configuration space is read-only. It also fails with > -EPERM if the CAP_NET_ADMIN is missing. > > Signed-off-by: Maxime Coquelin Reviewed-by: Eugenio Pérez > --- > drivers/vdpa/vdpa_user/vduse_dev.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > b/drivers/vdpa/vdpa_user/vduse_dev.c > index 94f54ea2eb06..4057b34ff995 100644 > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > @@ -151,6 +151,7 @@ static struct workqueue_struct *vduse_irq_bound_wq; > > static u32 allowed_device_id[] = { > VIRTIO_ID_BLOCK, > + VIRTIO_ID_NET, > }; > > static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa) > @@ -1694,6 +1695,10 @@ static bool features_is_valid(struct vduse_dev_config > *config) > (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) > return false; > > + if ((config->device_id == VIRTIO_ID_NET) && > + !(config->features & (1ULL << VIRTIO_F_VERSION_1))) > + return false; > + > return true; > } > > @@ -1801,6 +1806,10 @@ static int vduse_create_dev(struct vduse_dev_config > *config, > int ret; > struct vduse_dev *dev; > > + ret = -EPERM; > + if ((config->device_id == VIRTIO_ID_NET) && !capable(CAP_NET_ADMIN)) > + goto err; > + > ret = -EEXIST; > if (vduse_find_dev(config->name)) > goto err; > @@ -2044,6 +2053,7 @@ static const struct vdpa_mgmtdev_ops > vdpa_dev_mgmtdev_ops = { > > static struct virtio_device_id id_table[] = { > { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID }, > + { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID }, > { 0 }, > }; > > -- > 2.43.0 > >
Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested
On Fri, Jan 5, 2024 at 9:12 AM Maxime Coquelin wrote: > > > > On 1/5/24 03:45, Jason Wang wrote: > > On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin > > wrote: > >> > >> Virtio-net driver control queue implementation is not safe > >> when used with VDUSE. If the VDUSE application does not > >> reply to control queue messages, it currently ends up > >> hanging the kernel thread sending this command. > >> > >> Some work is on-going to make the control queue > >> implementation robust with VDUSE. Until it is completed, > >> let's fail features check if any control-queue related > >> feature is requested. > >> > >> Signed-off-by: Maxime Coquelin > >> --- > >> drivers/vdpa/vdpa_user/vduse_dev.c | 13 + > >> 1 file changed, 13 insertions(+) > >> > >> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > >> b/drivers/vdpa/vdpa_user/vduse_dev.c > >> index 0486ff672408..94f54ea2eb06 100644 > >> --- a/drivers/vdpa/vdpa_user/vduse_dev.c > >> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > >> @@ -28,6 +28,7 @@ > >> #include > >> #include > >> #include > >> +#include > >> #include > >> > >> #include "iova_domain.h" > >> @@ -46,6 +47,15 @@ > >> > >> #define IRQ_UNBOUND -1 > >> > >> +#define VDUSE_NET_INVALID_FEATURES_MASK \ > >> + (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\ > >> +BIT_ULL(VIRTIO_NET_F_CTRL_RX) | \ > >> +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) | \ > >> +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \ > >> +BIT_ULL(VIRTIO_NET_F_MQ) | \ > >> +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) | \ > >> +BIT_ULL(VIRTIO_NET_F_RSS)) > > > > We need to make this as well: > > > > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS > > I missed it, and see others have been added in the Virtio spec > repository (BTW, I see this specific one is missing in the dependency > list [0], I will submit a patch). > > I wonder if it is not just simpler to just check for > VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out, > the VDUSE driver won't be the one violating the spec so it should be > good? > > It will avoid having to update the mask if new features depending on it > are added (or forgetting to update it). > > WDYT? > I think it is safer to work with a whitelist, instead of a blacklist. As any new feature might require code changes in QEMU. Is that possible? > Thanks, > Maxime > > [0]: > https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129 > > Other than this, > > > > Acked-by: Jason Wang > > > > Thanks > > > >> + > >> struct vduse_virtqueue { > >> u16 index; > >> u16 num_max; > >> @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct > >> vduse_dev_config *config) > >> if ((config->device_id == VIRTIO_ID_BLOCK) && > >> (config->features & (1ULL << > >> VIRTIO_BLK_F_CONFIG_WCE))) > >> return false; > >> + else if ((config->device_id == VIRTIO_ID_NET) && > >> + (config->features & > >> VDUSE_NET_INVALID_FEATURES_MASK)) > >> + return false; > >> > >> return true; > >> } > >> -- > >> 2.43.0 > >> > > > >
[PATCH v10 2/2] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index f950648b0ba9..8c49489c5867 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -26,4 +26,6 @@ struct trace_buffer_meta { __u32 meta_struct_len;/* Len of this struct */ }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 46dbe22121e9..cfeaf2cd204e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8583,15 +8583,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = &info->iter; + int err; - if (cmd) - return -ENOIOCTLCMD; + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(&trace_types_lock); iter->wait_index++; @@ -8604,6 +8620,62 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = &info->iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_data; + struct trace_iterator *iter = &info->iter; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE); + vma->vm_ops = &tracing_buffers_vmops; + + return ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file); +} + static const struct file_operations tracing_buffers_fops = { .open = tracing_buffers_open, .rea
[PATCH v10 1/2] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..0841ba8bab14 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..f950648b0ba9 --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_TRACE_MMAP_H_ +#define _UAPI_TRACE_MMAP_H_ + +#include + +struct trace_buffer_meta { + unsigned long entries; + unsigned long overrun; + unsigned long read; + + unsigned long subbufs_touched; + unsigned long subbufs_lost; + unsigned long subbufs_read; + + struct { + unsigned long lost_events;/* Events lost at the time of the reader swap */ + __u32 id; /* Reader subbuf ID from 0 to nr_subbufs - 1 */ + __u32 read; /* Number of bytes read on the reader subbuf */ + } reader; + + __u32 subbuf_size;/* Size of each subbuf including the header */ + __u32 nr_subbufs; /* Number of subbufs in the ring-buffer */ + + __u32 meta_page_size; /* Size of the meta-page */ + __u32 meta_struct_len;/* Len of this struct */ +}; + +#endif /* _UAPI_TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 173d2595ce2d..63426e525781 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -338,6 +338,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -388,6 +389,7 @@ struct rb_irq_work { boolwaiters_pending; boolfull_waiters_pending; boolwakeup_full; + boolis_cpu_buffer; }; /* @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + int mapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to addr */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -739,6 +747,33 @@ static __always_inline bool full_hit(struct trace_buffer *buffer, int cpu, int f return (dirty * 100) > (full * nr_pages); } +static void rb_update_meta_page(struct rb_irq_work *rbwork) +{ + struct ring_buffer_per_cpu *cpu_buffer; + + if (!rbwork->is_cpu_buffer) + return; + /* +* If the waiter is a cpu_buffer, this might be due to a +* userspace mapping. Let's update the meta-page. +*/ + cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, +
[PATCH v10 0/2] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature in libtracefs can be found here: https://lore.kernel.org/all/20231228201100.78aae...@rorschach.local.home Vincent v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. --- Vincent Donnefort (2): ring-buffer: Introducing ring-buffer mapping functions tracing: Allow user-space mapping of the ring-buffer include/linux/ring_buffer.h | 7 + include/uapi/linux/trace_mmap.h | 31 +++ kernel/trace/ring_buffer.c | 384 +++- kernel/trace/trace.c| 79 ++- 4 files changed, 497 insertions(+), 4 deletions(-) create mode 100644 include/uapi/linux/trace_mmap.h base-commit: 3cb3091138ca0921c4569bcf7ffa062519639b6a -- 2.43.0.472.g3155946c3a-goog
Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events
On Mon, Oct 30, 2023 at 11:20:02AM -0700, Haitao Huang wrote: > From: Kristen Carlson Accardi > > The misc cgroup controller (subsystem) currently does not perform > resource type specific action for Cgroups Subsystem State (CSS) events: > the 'css_alloc' event when a cgroup is created and the 'css_free' event > when a cgroup is destroyed. > > Define callbacks for those events and allow resource providers to > register the callbacks per resource type as needed. This will be > utilized later by the EPC misc cgroup support implemented in the SGX > driver. I notice now that the callbacks are per resource and per cgroup. I think that is an overkill that complicates misc_cg allocation code with parent's callbacks while freeing is done by own callbacks. It's possibly too much liberal to keep objects' lifecycle understandable in the long term too. For some uniformity, I'd suggest struct blkcg_policy array in block/blk-cgroup.c as the precedent. (Perhaps indexed with static enum misc_res_type instead of dynamic index assignment as blkcg_policy_registeer() does.) I hope you don't really need per-cgroup callbacks :-) Michal signature.asc Description: PGP signature
Re: [PATCH v6 1/3] vduse: validate block features only with block devices
On Thu, Jan 4, 2024 at 4:38 PM Maxime Coquelin wrote: > > This patch is preliminary work to enable network device > type support to VDUSE. > > As VIRTIO_BLK_F_CONFIG_WCE shares the same value as > VIRTIO_NET_F_HOST_TSO4, we need to restrict its check > to Virtio-blk device type. > > Acked-by: Jason Wang > Reviewed-by: Xie Yongji > Signed-off-by: Maxime Coquelin Reviewed-by: Eugenio Pérez > --- > drivers/vdpa/vdpa_user/vduse_dev.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > b/drivers/vdpa/vdpa_user/vduse_dev.c > index 0ddd4b8abecb..0486ff672408 100644 > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > @@ -1671,13 +1671,14 @@ static bool device_is_allowed(u32 device_id) > return false; > } > > -static bool features_is_valid(u64 features) > +static bool features_is_valid(struct vduse_dev_config *config) > { > - if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) > + if (!(config->features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) > return false; > > /* Now we only support read-only configuration space */ > - if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE)) > + if ((config->device_id == VIRTIO_ID_BLOCK) && > + (config->features & (1ULL << > VIRTIO_BLK_F_CONFIG_WCE))) > return false; > > return true; > @@ -1704,7 +1705,7 @@ static bool vduse_validate_config(struct > vduse_dev_config *config) > if (!device_is_allowed(config->device_id)) > return false; > > - if (!features_is_valid(config->features)) > + if (!features_is_valid(config)) > return false; > > return true; > -- > 2.43.0 > >
Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested
On 1/5/24 03:45, Jason Wang wrote: On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin wrote: Virtio-net driver control queue implementation is not safe when used with VDUSE. If the VDUSE application does not reply to control queue messages, it currently ends up hanging the kernel thread sending this command. Some work is on-going to make the control queue implementation robust with VDUSE. Until it is completed, let's fail features check if any control-queue related feature is requested. Signed-off-by: Maxime Coquelin --- drivers/vdpa/vdpa_user/vduse_dev.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 0486ff672408..94f54ea2eb06 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "iova_domain.h" @@ -46,6 +47,15 @@ #define IRQ_UNBOUND -1 +#define VDUSE_NET_INVALID_FEATURES_MASK \ + (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\ +BIT_ULL(VIRTIO_NET_F_CTRL_RX) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) | \ +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \ +BIT_ULL(VIRTIO_NET_F_MQ) | \ +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) | \ +BIT_ULL(VIRTIO_NET_F_RSS)) We need to make this as well: VIRTIO_NET_F_CTRL_GUEST_OFFLOADS I missed it, and see others have been added in the Virtio spec repository (BTW, I see this specific one is missing in the dependency list [0], I will submit a patch). I wonder if it is not just simpler to just check for VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out, the VDUSE driver won't be the one violating the spec so it should be good? It will avoid having to update the mask if new features depending on it are added (or forgetting to update it). WDYT? Thanks, Maxime [0]: https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129 Other than this, Acked-by: Jason Wang Thanks + struct vduse_virtqueue { u16 index; u16 num_max; @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config *config) if ((config->device_id == VIRTIO_ID_BLOCK) && (config->features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))) return false; + else if ((config->device_id == VIRTIO_ID_NET) && + (config->features & VDUSE_NET_INVALID_FEATURES_MASK)) + return false; return true; } -- 2.43.0