Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/23/20 2:16 PM, Peter Zijlstra wrote: On Fri, Oct 23, 2020 at 01:56:47PM +0100, Suzuki Poulose wrote: On 10/23/20 11:54 AM, Peter Zijlstra wrote: I think I'm more confused now :-/ Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails. The event->sink link is established at creation. At event::add(), we check the sink is free (i.e, it is inactive) or is used by an event of the same session (this is where the owner field *was* required. But this is not needed anymore, as we cache the "owner" read pid in the handle->rb->aux_priv for each event and this is compared against the pid from the handle currently driving the hardware) *groan*.. that's going to be a mess with sinks that are shared between CPUs :/ I'm also not seeing why exactly we need ->owner in the first place. Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail. Thats where we are diverging. The sink device doesn't have any fops. It is all managed by the coresight driver transparent to the perf tool. All the perf tool does is, specifying which sink to use (btw, we now have automatic sink selection support which gets rid of this, and uses the best possible sink e.g, in case of per-CPU sinks). per-CPU sinks sounds a lot better. I'm really not convinced it makes sense to do what you do with shared sinks though. You'll loose random parts of the execution trace because of what the other CPUs do. The ETM trace protocol has in built TraceID to distinguish the packets and thus we could decode the trace streams from the shared buffer. [ But, we don't have buffer overflow interrupts (I am keeping the lid closed on that can, for the sake of keeping sanity ;-) ), and thus any shared session could easily loose data unless we tune the AUX buffer size to a really large buffer ]. Full exclusive sink access is far more deterministic. Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to it. Or are you doing the event->sink mapping every time you do: pmu::add()? That sounds insane. Sink is already mapped at event create. But yes, the refcount on the sink is managed at start/stop. Thats when we need to make sure that the event being scheduled belongs to the same owner as the one already driving the sink. pmu::add() I might hope, because pmu::start() is not allowed to fail. Right. If we can't get the sink, we simply truncate the buffer. That way another session could use the same sink if it is free. i.e perf record -e cs_etm/@sink0/u --per-thread app1 and perf record -e cs_etm/@sink0/u --per-thread app2 both can work as long as the sink is not used by the other session. Like said above, if sink is shared between CPUs, that's going to be a trainwreck :/ Why do you want that? That ship has sailed. That is how the current generation of systems are, unfortunately. But as I said, this is changing and there are guidelines in place to avoid these kind of topologies. With the future technologies, this will be completely gone. And once you have per-CPU sinks like mentioned above, the whole problem goes away. True, until then, this is the best we could do. Suzuki
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/23/20 11:54 AM, Peter Zijlstra wrote: On Fri, Oct 23, 2020 at 11:34:32AM +0100, Suzuki Poulose wrote: On 10/23/20 10:41 AM, Peter Zijlstra wrote: On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote: On 10/23/20 8:39 AM, Peter Zijlstra wrote: So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing. I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup. Sorry for the lack of clarity here, and you are not alone unless you have drowned in the CoreSight topologies ;-) Typically current generation of systems have the following topology : CPU0 etm0 \ \ / \ CPU1/\ etm1\ \ /--- sink0 CPU2 / etm2 \/ \ / / CPU3/ etm3 i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have used one sink. Even though there could be potential sinks (of different types), none of them are private to the ETMs. So, in a nutshell, a sink can be reached by multiple ETMs. ] Now, for a session : perf record -e cs_etm/sinkid=sink0/u workload We create an event per CPU (say eventN, which are scheduled based on the threads that could execute on the CPU. At this point we have finalized the sink0, and have allocated necessary buffer for the sink0. Now, when the threads are scheduled on the CPUs, we start the appropriate events for the CPUs. e.g, CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all the components upto sink0, starting the trace collection in the buffer. Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread, it will eventually try to turn ON the sink0.Since sink0 is already active tracing event0, we could allow this to go through and collect the trace in the *same hardware buffer* (which can be demuxed from the single AUX record using the TraceID in the packets). Please note that we do double buffering and hardware buffer is copied only when the sink0 is stopped (see below). But, if the event scheduled on CPU1 doesn't belong to the above session, but belongs to different perf session (say, perf record -e cs_etm/sinkid=sink0/u benchmark), we can't allow this to succeed and mix the trace data in to that of workload and thus fail the operation. In a nutshell, since the sinks are shared, we start the sink on the first event and keeps sharing the sink buffer with any event that belongs to the same session (using refcounts). The sink is only released for other sessions, when there are no more events in the session tracing on any of the ETMs. I know this is fundamentally a topology issue, but that is not something we can fix. But the situation is changing and we are starting to see systems with per-CPU sinks. Hope this helps. I think I'm more confused now :-/ Where do we use ->owner after event creation? The moment you create your eventN you create the link to sink0. That link either succeeds (same 'cookie') or fails. The event->sink link is established at creation. At event::add(), we check the sink is free (i.e, it is inactive) or is used by an event of the same session (this is where the owner field *was* required. But this is not needed anymore, as we cache the "owner" read pid in the handle->rb->aux_priv for each event and this is compared against the pid from the handle currently driving the hardware) If it fails, event creation fails, the end. On success, we have the sink pointer in our event and we never ever need to look at ->owner ever again. Correct, we don't need it after the event creation "one part of the patch" as explained above. I'm also not seeing why exactly we need ->owner in the first place. Suppose we make the sink0 device return -EBUSY on open() when it is active. Then a perf session can open the sink0 device, create perf events and attach them to the sink0 device using perf_event_attr::config2. The events will attach to sink0 and increment its usage count, such that any further open() will fail. Thats where we are diverging. The sink device doesn't have any fops. It is all managed by the coresight driver transparent to the perf tool. All the perf tool does is, specifying which sink to use (btw, we now have automatic sink selection support which gets rid of this, and uses the best possible sink e.g, in case of per-CPU sinks). Once the events are created, the perf tool close()s the sink0 device, which is now will in-use by the events. No other events can be attached to
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/23/20 10:23 AM, Peter Zijlstra wrote: On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote: On 10/23/20 8:39 AM, Peter Zijlstra wrote: So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing. One follow up question on the !->owner issue. Given the ->owner is dying, does it prevent events from being scheduled ? Or is there a delay between that and eventually stopping the events. In this case, we hit the issue when : A A or B ? event_start() ... event->owner = NULL READ_ONCE(event->owner); Is this expected ? Yeah, teardown is a bit of an effort. Also, you can pass an fd over a unix socket to another process, so this isn't something you can rely on in any case. The perf tool doesn't do it, but the kernel infra should be able to deal with someone doing a perf-deamon of sorts, where you can request a perf event and recieve a fd from it. Imagine the fun ;-) As for the kernel events.. why do you care about the actual task_struct * in there? I see you're using it to grab the task-pid, but how is that useful? Correct, kernel events are something that the driver didn't account for. May be we could handle this case with a "special pid" and simply disallow sharing (which is fine I believe, given there are not grouping for the kernel created events). Why do you need a pid in the first place? Can't you use the "task_struct *" as a value? We could. But, without a refcount on the task pointer, that could be tricky, even though we don't dereference it. In the same situation, if the tsk owner dies and is freed and is reallocated to a new perf session task but with different PID, we could be mixing things up again ? Special pid here could be -1.
Re: [PATCH v3] coresight: etm4x: Modify core-commit of cpu to avoid the overflow of HiSilicon ETM
On 10/23/20 11:23 AM, Qi Liu wrote: The ETM device can't keep up with the core pipeline when cpu core is at full speed. This may cause overflow within core and its ETM. This is a common phenomenon on ETM devices. On HiSilicon Hip08 platform, a specific feature is added to set core pipeline. So commit rate can be reduced manually to avoid ETM overflow. Signed-off-by: Qi Liu --- Change since v1: - add CONFIG_ETM4X_IMPDEF_FEATURE and CONFIG_ETM4X_IMPDEF_HISILICON to keep specific feature off platforms which don't use it. Change since v2: - remove some unused variable. drivers/hwtracing/coresight/Kconfig| 18 drivers/hwtracing/coresight/coresight-etm4x-core.c | 50 ++ 2 files changed, 68 insertions(+) diff --git a/drivers/hwtracing/coresight/Kconfig b/drivers/hwtracing/coresight/Kconfig index c119824..9665d70 100644 --- a/drivers/hwtracing/coresight/Kconfig +++ b/drivers/hwtracing/coresight/Kconfig @@ -110,6 +110,24 @@ config CORESIGHT_SOURCE_ETM4X To compile this driver as a module, choose M here: the module will be called coresight-etm4x. +config ETM4X_IMPDEF_FEATURE + bool "Control overflow impdef support in CoreSight ETM 4.x driver " + depends on CORESIGHT_SOURCE_ETM4X + help + This control provides overflow implement define for CoreSight + ETM 4.x tracer module which could not reduce commit race + automatically, and could avoid overflow within ETM tracer module + and its cpu core. + +config ETM4X_IMPDEF_HISILICON + bool "Control overflow impdef support in HiSilicon ETM 4.x driver " + depends on ETM4X_IMPDEF_FEATURE + help + This control provides overflow implement define for HiSilicon + ETM 4.x tracer module of Hip08 platform. Overflow within ETM + tracer module and its cpu core can be avoided by reducing core + commit manually. + config CORESIGHT_STM tristate "CoreSight System Trace Macrocell driver" depends on (ARM && !(CPU_32v3 || CPU_32v4 || CPU_32v4T)) || ARM64 diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c index abd706b..35f4333 100644 --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c @@ -103,12 +103,61 @@ struct etm4_enable_arg { int rc; }; +#ifdef CONFIG_ETM4X_IMPDEF_FEATURE + +#ifdef CONFIG_ETM4X_IMPDEF_HISILICON + +#define HISI_HIP08_CORE_COMMIT_CLEAR 0x3000 +#define HISI_HIP08_CORE_COMMIT_SHIFT 12 +static void etm4_hisi_config_core_commit(int flag) nit: s/int flag/bool enable ? +{ + u64 val; + + asm volatile("mrs %0,s3_1_c15_c2_5" : "=r"(val)); + val &= ~HISI_HIP08_CORE_COMMIT_CLEAR; + val |= flag << HISI_HIP08_CORE_COMMIT_SHIFT; + asm volatile("msr s3_1_c15_c2_5,%0" : : "r"(val)); Sorry for missing this out. We don't encourage the above encodings as it will break on legacy toolchain. So, please could you follow what we do normally, by using sys_reg() macros to define the encoding and use read/write_sysreg_s() instead. See arch/arm64/include/asm/sysreg.h +} +#else +static void etm4_hisi_config_core_commit(int flag) +{ +} +#endif /* CONFIG_ETM4X_IMPDEF_HISILICON */ + +static void etm4_enable_arch_specific(void) +{ + /* +* If ETM device is HiSilicon ETM device, reduce the +* core-commit to avoid ETM overflow. +*/ + etm4_hisi_config_core_commit(1); +} + +static void etm4_disable_arch_specific(void) +{ + /* +* If ETM device is HiSilicon ETM device, resume the +* core-commit after ETM trace is complete. +*/ + etm4_hisi_config_core_commit(0); +} +#else +static void etm4_enable_arch_specific(void) +{ +} + +static void etm4_disable_arch_specific(void) +{ +} +#endif /* CONFIG_ETM4X_IMPDEF_FEATURE */ + static int etm4_enable_hw(struct etmv4_drvdata *drvdata) { int i, rc; struct etmv4_config *config = &drvdata->config; struct device *etm_dev = &drvdata->csdev->dev; + etm4_enable_arch_specific(); CS_UNLOCK(drvdata->base); Please could we move the enable_arch_specific() after the CS_UNLOCK ? That way any potential code could do something with the ETM4x registers. etm4_os_unlock(drvdata); @@ -475,6 +524,7 @@ static void etm4_disable_hw(void *info) struct device *etm_dev = &drvdata->csdev->dev; int i; + etm4_disable_arch_specific(); CS_UNLOCK(drvdata->base); Same here Suzuki
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/23/20 10:41 AM, Peter Zijlstra wrote: On Fri, Oct 23, 2020 at 09:49:53AM +0100, Suzuki Poulose wrote: On 10/23/20 8:39 AM, Peter Zijlstra wrote: So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing. I'm fundamentally failing here. Creating a link to the sink is strictly event-creation time. Why would you ever need it again later? Later you already have the sink setup. Sorry for the lack of clarity here, and you are not alone unless you have drowned in the CoreSight topologies ;-) Typically current generation of systems have the following topology : CPU0 etm0 \ \ / \ CPU1/\ etm1\ \ /--- sink0 CPU2 / etm2 \/ \ / / CPU3/ etm3 i.e, Multiple ETMs share a sink. [for the sake of simplicity, I have used one sink. Even though there could be potential sinks (of different types), none of them are private to the ETMs. So, in a nutshell, a sink can be reached by multiple ETMs. ] Now, for a session : perf record -e cs_etm/sinkid=sink0/u workload We create an event per CPU (say eventN, which are scheduled based on the threads that could execute on the CPU. At this point we have finalized the sink0, and have allocated necessary buffer for the sink0. Now, when the threads are scheduled on the CPUs, we start the appropriate events for the CPUs. e.g, CPU0 sched -> workload:0 - > etm0->event0_start -> Turns all the components upto sink0, starting the trace collection in the buffer. Now, if another CPU, CPU1 starts tracing event1 for workload:1 thread, it will eventually try to turn ON the sink0.Since sink0 is already active tracing event0, we could allow this to go through and collect the trace in the *same hardware buffer* (which can be demuxed from the single AUX record using the TraceID in the packets). Please note that we do double buffering and hardware buffer is copied only when the sink0 is stopped (see below). But, if the event scheduled on CPU1 doesn't belong to the above session, but belongs to different perf session (say, perf record -e cs_etm/sinkid=sink0/u benchmark), we can't allow this to succeed and mix the trace data in to that of workload and thus fail the operation. In a nutshell, since the sinks are shared, we start the sink on the first event and keeps sharing the sink buffer with any event that belongs to the same session (using refcounts). The sink is only released for other sessions, when there are no more events in the session tracing on any of the ETMs. I know this is fundamentally a topology issue, but that is not something we can fix. But the situation is changing and we are starting to see systems with per-CPU sinks. Hope this helps. Suzuki
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
Hi Peter On 10/23/20 8:39 AM, Peter Zijlstra wrote: On Thu, Oct 22, 2020 at 03:20:33PM -0600, Mathieu Poirier wrote: Suzuki's depiction of the usecase is accurate. Using the pid of the process that created the events comes out of a discussion you and I had in the common area by the Intel booth at ELC in Edinburgh in the fall of 2018. At the time I exposed the problem of having multiple events sharing the same HW resources and you advised to proceed this way. Bah, I was afraid of that. I desperately tried to find correspondence on it, but alas, verbal crap doesn't end up in the Sent folder :-/ That being said it is plausible that I did not expressed myself clearly enough for you to understand the full extend of the problem. If that is the case we are more than willing to revisit that solution. Do you see a better option than what has currently been implemented? Moo... that really could've done with a comment I suppose. So then I don't understand the !->owner issue, that only happens when the task dies, which cannot be concurrent with event creation. Are you Part of the patch from Sai, fixes this by avoiding the dereferencing after event creation (by caching it). But the kernel events needs fixing. One follow up question on the !->owner issue. Given the ->owner is dying, does it prevent events from being scheduled ? Or is there a delay between that and eventually stopping the events. In this case, we hit the issue when : A A or B ? event_start() ... event->owner = NULL READ_ONCE(event->owner); Is this expected ? somehow accessing ->owner later? As for the kernel events.. why do you care about the actual task_struct * in there? I see you're using it to grab the task-pid, but how is that useful? Correct, kernel events are something that the driver didn't account for. May be we could handle this case with a "special pid" and simply disallow sharing (which is fine I believe, given there are not grouping for the kernel created events). Suzuki
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/22/20 4:06 PM, Peter Zijlstra wrote: On Thu, Oct 22, 2020 at 02:30:21PM +0100, Suzuki Poulose wrote: On 10/22/20 12:32 PM, Peter Zijlstra wrote: On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote: Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf; + struct task_struct *task = READ_ONCE(event->owner); + + if (!task || is_kernel_event(event)) + return NULL; This is *wrong*... why do you care about who owns the events? This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ? I thought we added sink identification through perf_event_attr::config2 ? Correct. attr:config2 identifies the "sink" for the collection. But, that doesn't solve the problem we have here. If two separate perf sessions use the "same sink", we don't want to mix the trace data into the same sink for events from different sessions. Thus, we need a way to check if a new event starting the tracing on an ETM belongs to the same session as the one already pumping the trace into the sink. We use event->owner pid for this check and thats where we encountered a NULL event->owner. Looking at the code further, we identified that kernel events could also trigger this issue. Suzuki
Re: [PATCHv2 2/4] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/22/20 12:32 PM, Peter Zijlstra wrote: On Thu, Oct 22, 2020 at 04:27:52PM +0530, Sai Prakash Ranjan wrote: Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. @@ -391,6 +392,10 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf; + struct task_struct *task = READ_ONCE(event->owner); + + if (!task || is_kernel_event(event)) + return NULL; This is *wrong*... why do you care about who owns the events? This is due to the special case of the CoreSight configuration, where a "sink" (where the trace data is captured) is shared by multiple Trace units. So, we could share the "sink" for multiple trace units if they are tracing the events that belong to the same "perf" session. (The userspace tool could decode the trace data based on the TraceID in the trace packets). Is there a better way to do this ? Suzuki
Re: [PATCH 1/2] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/22/20 12:07 PM, Sai Prakash Ranjan wrote: On 2020-10-22 14:57, Suzuki Poulose wrote: On 10/22/20 9:02 AM, Sai Prakash Ranjan wrote: On 2020-10-21 15:38, Suzuki Poulose wrote: On 10/21/20 8:29 AM, Sai Prakash Ranjan wrote: On 2020-10-20 21:40, Sai Prakash Ranjan wrote: On 2020-10-14 21:29, Sai Prakash Ranjan wrote: On 2020-10-14 18:46, Suzuki K Poulose wrote: On 10/14/2020 10:36 AM, Sai Prakash Ranjan wrote: On 2020-10-13 22:05, Suzuki K Poulose wrote: On 10/07/2020 02:00 PM, Sai Prakash Ranjan wrote: There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference. Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the The patch is necessary to fix some of the issues. But I feel it is not complete. Why is it safe earlier and not later ? I believe we are simply reducing the chances of hitting the issue, by doing this earlier than later. I would say we better fix all instances to make sure that the event->owner is valid. (e.g, I can see that the for kernel events event->owner == -1 ?) struct task_struct *tsk = READ_ONCE(event->owner); if (!tsk || is_kernel_event(event)) /* skip ? */ Looking at it some more, is_kernel_event() is not exposed outside events core and probably for good reason. Why do we need to check for this and not just tsk? Because the event->owner could be : = NULL = -1UL // kernel event = valid. Yes I understood that part, but here we were trying to fix the NULL pointer dereference right and hence the question as to why we need to check for kernel events? I am no expert in perf but I don't see anywhere in the kernel checking for is_kernel_event(), so I am a bit skeptical if exporting that is actually right or not. I have stress tested with the original patch many times now, i.e., without a check for event->owner and is_kernel_event() and didn't observe any crash. Plus on ETR where this was already done, no crashes were reported till date and with ETF, the issue was quickly reproducible, so I am fairly confident that this doesn't just delay the original issue but actually fixes it. I will run an overnight test again to confirm this. I ran the overnight test which collected aroung 4G data(see below), with the following small change to see if the two cases (event->owner=NULL and is_kernel_event()) are triggered with suggested changes and it didn't trigger at all. Do we still need those additional checks? Yes. Please see perf_event_create_kernel_event(), which is an exported function allowing any kernel code (including modules) to use the PMU (just like the userspace perf tool would do). Just because your use case doesn't trigger this (because you don't run something that can trigger this) doesn't mean this can't be triggered. Thanks for that pointer, I will add them in the next version. And instead of redefining TASK_TOMBSTONE in the driver, you may simply use IS_ERR_OR_NULL(tsk) to cover both NULL case and kernel event. Ugh sorry, sent out v2 exporting is_kernel_event() before seeing this comment, I will resend. Saw that. I would say, wait until someone complains about that. If people are Ok with exporting it, it is fine. I guess it will be useful. You could fall back to this approach if there is resistance. Cheers Suzuki
Re: [PATCH 1/2] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/22/20 9:02 AM, Sai Prakash Ranjan wrote: On 2020-10-21 15:38, Suzuki Poulose wrote: On 10/21/20 8:29 AM, Sai Prakash Ranjan wrote: On 2020-10-20 21:40, Sai Prakash Ranjan wrote: On 2020-10-14 21:29, Sai Prakash Ranjan wrote: On 2020-10-14 18:46, Suzuki K Poulose wrote: On 10/14/2020 10:36 AM, Sai Prakash Ranjan wrote: On 2020-10-13 22:05, Suzuki K Poulose wrote: On 10/07/2020 02:00 PM, Sai Prakash Ranjan wrote: There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference. Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the The patch is necessary to fix some of the issues. But I feel it is not complete. Why is it safe earlier and not later ? I believe we are simply reducing the chances of hitting the issue, by doing this earlier than later. I would say we better fix all instances to make sure that the event->owner is valid. (e.g, I can see that the for kernel events event->owner == -1 ?) struct task_struct *tsk = READ_ONCE(event->owner); if (!tsk || is_kernel_event(event)) /* skip ? */ Looking at it some more, is_kernel_event() is not exposed outside events core and probably for good reason. Why do we need to check for this and not just tsk? Because the event->owner could be : = NULL = -1UL // kernel event = valid. Yes I understood that part, but here we were trying to fix the NULL pointer dereference right and hence the question as to why we need to check for kernel events? I am no expert in perf but I don't see anywhere in the kernel checking for is_kernel_event(), so I am a bit skeptical if exporting that is actually right or not. I have stress tested with the original patch many times now, i.e., without a check for event->owner and is_kernel_event() and didn't observe any crash. Plus on ETR where this was already done, no crashes were reported till date and with ETF, the issue was quickly reproducible, so I am fairly confident that this doesn't just delay the original issue but actually fixes it. I will run an overnight test again to confirm this. I ran the overnight test which collected aroung 4G data(see below), with the following small change to see if the two cases (event->owner=NULL and is_kernel_event()) are triggered with suggested changes and it didn't trigger at all. Do we still need those additional checks? Yes. Please see perf_event_create_kernel_event(), which is an exported function allowing any kernel code (including modules) to use the PMU (just like the userspace perf tool would do). Just because your use case doesn't trigger this (because you don't run something that can trigger this) doesn't mean this can't be triggered. Thanks for that pointer, I will add them in the next version. And instead of redefining TASK_TOMBSTONE in the driver, you may simply use IS_ERR_OR_NULL(tsk) to cover both NULL case and kernel event. Cheers Suzuki
Re: [PATCH 1/2] coresight: tmc-etf: Fix NULL ptr dereference in tmc_enable_etf_sink_perf()
On 10/21/20 8:29 AM, Sai Prakash Ranjan wrote: On 2020-10-20 21:40, Sai Prakash Ranjan wrote: On 2020-10-14 21:29, Sai Prakash Ranjan wrote: On 2020-10-14 18:46, Suzuki K Poulose wrote: On 10/14/2020 10:36 AM, Sai Prakash Ranjan wrote: On 2020-10-13 22:05, Suzuki K Poulose wrote: On 10/07/2020 02:00 PM, Sai Prakash Ranjan wrote: There was a report of NULL pointer dereference in ETF enable path for perf CS mode with PID monitoring. It is almost 100% reproducible when the process to monitor is something very active such as chrome and with ETF as the sink and not ETR. Currently in a bid to find the pid, the owner is dereferenced via task_pid_nr() call in tmc_enable_etf_sink_perf() and with owner being NULL, we get a NULL pointer dereference. Looking at the ETR and other places in the kernel, ETF and the ETB are the only places trying to dereference the task(owner) in tmc_enable_etf_sink_perf() which is also called from the sched_in path as in the call trace. Owner(task) is NULL even in the case of ETR in tmc_enable_etr_sink_perf(), but since we cache the PID in alloc_buffer() callback and it is done as part of etm_setup_aux() when allocating buffer for ETR sink, we never dereference this NULL pointer and we are safe. So lets do the The patch is necessary to fix some of the issues. But I feel it is not complete. Why is it safe earlier and not later ? I believe we are simply reducing the chances of hitting the issue, by doing this earlier than later. I would say we better fix all instances to make sure that the event->owner is valid. (e.g, I can see that the for kernel events event->owner == -1 ?) struct task_struct *tsk = READ_ONCE(event->owner); if (!tsk || is_kernel_event(event)) /* skip ? */ Looking at it some more, is_kernel_event() is not exposed outside events core and probably for good reason. Why do we need to check for this and not just tsk? Because the event->owner could be : = NULL = -1UL // kernel event = valid. Yes I understood that part, but here we were trying to fix the NULL pointer dereference right and hence the question as to why we need to check for kernel events? I am no expert in perf but I don't see anywhere in the kernel checking for is_kernel_event(), so I am a bit skeptical if exporting that is actually right or not. I have stress tested with the original patch many times now, i.e., without a check for event->owner and is_kernel_event() and didn't observe any crash. Plus on ETR where this was already done, no crashes were reported till date and with ETF, the issue was quickly reproducible, so I am fairly confident that this doesn't just delay the original issue but actually fixes it. I will run an overnight test again to confirm this. I ran the overnight test which collected aroung 4G data(see below), with the following small change to see if the two cases (event->owner=NULL and is_kernel_event()) are triggered with suggested changes and it didn't trigger at all. Do we still need those additional checks? Yes. Please see perf_event_create_kernel_event(), which is an exported function allowing any kernel code (including modules) to use the PMU (just like the userspace perf tool would do). Just because your use case doesn't trigger this (because you don't run something that can trigger this) doesn't mean this can't be triggered. Cheers Suzuki [ perf record: Captured and wrote 4677.989 MB perf.data ] diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c index 989d965f3d90..123c446ce585 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etf.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c @@ -13,6 +13,13 @@ #include "coresight-tmc.h" #include "coresight-etm-perf.h" +#define TASK_TOMBSTONE ((void *)-1L) + +static bool is_kernel_event2(struct perf_event *event) +{ + return READ_ONCE(event->owner) == TASK_TOMBSTONE; +} + static int tmc_set_etf_buffer(struct coresight_device *csdev, struct perf_output_handle *handle); @@ -392,6 +399,15 @@ static void *tmc_alloc_etf_buffer(struct coresight_device *csdev, { int node; struct cs_buffers *buf; + struct task_struct *task = READ_ONCE(event->owner); + + if (!task) { + pr_info("**sai in task=NULL**\n"); + return NULL; + } + + if (is_kernel_event2(event)) + pr_info("**sai in is_kernel_event**\n"); node = (event->cpu == -1) ? NUMA_NO_NODE : cpu_to_node(event->cpu); Thanks, Sai
Re: [PATCH] coresight: etm4x: Add config to exclude kernel mode tracing
On 10/16/20 2:14 PM, Leo Yan wrote: On Fri, Oct 16, 2020 at 12:38:47PM +0100, Suzuki Kuruppassery Poulose wrote: [...] What happens to the sysfs mode of tracing? For that we would still need a config right to exclude kernel mode tracing completely. IIUC, sysfs mode and perf mode both can apply the same approach, the guest OS runs a thread context for the host, so when a guest OS is switched in or out, the hypervisor can save/restore the context for the guest OS; thus every guest OS will have its dedicated context and trace data ideally. I don't think Guest Context is something we can support as mentioned above, at least for systems without sysreg access for ETMs (and virtualizing ETRs is a different story !) Thanks for sharing thoughts, Suzuki. I missed the device virtulisation. Here should virtualize all devices (includes CoreSight ETM/funnel/ETR/ETF)? Or only need to virtualize ETRs? I wouldn't even think of virtualizing the components without sysreg access. So let us not worry about it :-) Cheers Suzuki
Re: [PATCH] coresight: etm4x: Skip setting LPOVERRIDE bit for qcom,skip-power-up
On 10/16/20 12:47 PM, Sai Prakash Ranjan wrote: Hi Suzuki, On 2020-10-16 16:51, Suzuki Poulose wrote: Hi Sai, On 10/16/20 11:10 AM, Sai Prakash Ranjan wrote: There is a bug on the systems supporting to skip power up (qcom,skip-power-up) where setting LPOVERRIDE bit(low-power state override behaviour) will result in CPU hangs/lockups even on the implementations which supports it. So skip setting the LPOVERRIDE bit for such platforms. Fixes: 02510a5aa78d ("coresight: etm4x: Add support to skip trace unit power up") Signed-off-by: Sai Prakash Ranjan The fix is fine by me. Btw, is there a hardware Erratum assigned for this ? It would be good to have the Erratum documented somewhere, preferrably ( Documentation/arm64/silicon-errata.rst ) No, afaik we don't have any erratum assigned to this bug. Ok. Please double check, if there are any. It was already present in downstream kernel and since we support these targets with the previous HW bug (qcom,skip-power-up) now in upstream, we would need this fix in upstream kernel as well. I understand the need for the fix and we must fix it. I was looking to document this in the central place for errata's handled in the kernel. And I missed asking this question when the original patch was posted. So, thought of asking the question now anyways. Better late than never ;-) Reviewed-by: Suzuki K Poulose
Re: [PATCH] coresight: etm4x: Add config to exclude kernel mode tracing
On 10/16/20 10:24 AM, Leo Yan wrote: Hi Sai, On Fri, Oct 16, 2020 at 02:10:47PM +0530, Sai Prakash Ranjan wrote: Hi Leo, On 2020-10-16 12:54, Leo Yan wrote: On Thu, Oct 15, 2020 at 11:40:05PM -0700, Denis Nikitin wrote: Hi Mathieu, I think one of the use cases could be VMs. Is there isolation between EL1 guest kernels which we can control from perf in a system wide mode? Sorry for suddenly jumping in. For KVM, I think we need to implement mechanism for saving/restoring CoreSight context for every guest OS, the CPU PMUs has implemented related features [1]. Thanks, Leo [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kvm/pmu.c Its not as easy as the CPU PMU for virtualizing the ETMs (with memory mapped access only), i.e supporting ETMs from VMs. We could definitely stop/resume on guest entry/exit, to support attr.exclude_guest. What happens to the sysfs mode of tracing? For that we would still need a config right to exclude kernel mode tracing completely. IIUC, sysfs mode and perf mode both can apply the same approach, the guest OS runs a thread context for the host, so when a guest OS is switched in or out, the hypervisor can save/restore the context for the guest OS; thus every guest OS will have its dedicated context and trace data ideally. I don't think Guest Context is something we can support as mentioned above, at least for systems without sysreg access for ETMs (and virtualizing ETRs is a different story !) Cheers Suzuki
Re: [PATCH] coresight: etm4x: Skip setting LPOVERRIDE bit for qcom,skip-power-up
Hi Sai, On 10/16/20 11:10 AM, Sai Prakash Ranjan wrote: There is a bug on the systems supporting to skip power up (qcom,skip-power-up) where setting LPOVERRIDE bit(low-power state override behaviour) will result in CPU hangs/lockups even on the implementations which supports it. So skip setting the LPOVERRIDE bit for such platforms. Fixes: 02510a5aa78d ("coresight: etm4x: Add support to skip trace unit power up") Signed-off-by: Sai Prakash Ranjan The fix is fine by me. Btw, is there a hardware Erratum assigned for this ? It would be good to have the Erratum documented somewhere, preferrably ( Documentation/arm64/silicon-errata.rst ) --- drivers/hwtracing/coresight/coresight-etm4x-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c index abd706b216ac..6096d7abf80d 100644 --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c @@ -779,7 +779,7 @@ static void etm4_init_arch_data(void *info) * LPOVERRIDE, bit[23] implementation supports * low-power state override */ - if (BMVAL(etmidr5, 23, 23)) + if (BMVAL(etmidr5, 23, 23) && (!drvdata->skip_power_up)) drvdata->lpoverride = true; else drvdata->lpoverride = false; base-commit: 3477326277451000bc667dfcc4fd0774c039184c
Re: [PATCH] coresight: etm4x: Add config to exclude kernel mode tracing
On 10/16/20 7:40 AM, Denis Nikitin wrote: Hi Mathieu, I think one of the use cases could be VMs. Is there isolation between EL1 guest kernels which we can control from perf in a system wide mode? The proposed solution doesn't solve this for VMs anyway. It only excludes EL1 *OR* EL2, depending on the host kernel's running EL. We cannot support Virtual ETM access for VMs with memory mapped accesses. Unforutnately, trace filtering is the solution for preventing tracing for EL1 guest/kernel (available from v8.4 Self Hosted extensions). Other option is to add support for "exclude_guest" support for CoreSight for perf. But again this can't be controlled by sysfs. And it can't be enforced for perf, if not specified. Again it all goes back to the root permission hammer lock which Mathieu pointed out. With the v8.4 Self hosted trace extensions, Guest and Host both could control individually if they can be traced (both EL0 and EL1/2). Suzuki Thanks, Denis On Thu, Oct 15, 2020 at 9:03 AM Mathieu Poirier mailto:mathieu.poir...@linaro.org>> wrote: On Thu, Oct 15, 2020 at 06:15:22PM +0530, Sai Prakash Ranjan wrote: > On production systems with ETMs enabled, it is preferred to > exclude kernel mode(NS EL1) tracing for security concerns and > support only userspace(NS EL0) tracing. So provide an option > via kconfig to exclude kernel mode tracing if it is required. > This config is disabled by default and would not affect the > current configuration which has both kernel and userspace > tracing enabled by default. > One requires root access (or be part of a special trace group) to be able to use the cs_etm PMU. With this kind of elevated access restricting tracing at EL1 provides little in terms of security. Thanks, Mathieu > Signed-off-by: Sai Prakash Ranjan mailto:saiprakash.ran...@codeaurora.org>> > --- > drivers/hwtracing/coresight/Kconfig | 9 + > drivers/hwtracing/coresight/coresight-etm4x-core.c | 6 +- > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/drivers/hwtracing/coresight/Kconfig b/drivers/hwtracing/coresight/Kconfig > index c1198245461d..52435de8824c 100644 > --- a/drivers/hwtracing/coresight/Kconfig > +++ b/drivers/hwtracing/coresight/Kconfig > @@ -110,6 +110,15 @@ config CORESIGHT_SOURCE_ETM4X > To compile this driver as a module, choose M here: the > module will be called coresight-etm4x. > > +config CORESIGHT_ETM4X_EXCL_KERN > + bool "Coresight ETM 4.x exclude kernel mode tracing" > + depends on CORESIGHT_SOURCE_ETM4X > + help > + This will exclude kernel mode(NS EL1) tracing if enabled. This option > + will be useful to provide more flexible options on production systems > + where only userspace(NS EL0) tracing might be preferred for security > + reasons. > + > config CORESIGHT_STM > tristate "CoreSight System Trace Macrocell driver" > depends on (ARM && !(CPU_32v3 || CPU_32v4 || CPU_32v4T)) || ARM64 > diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c > index abd706b216ac..7e5669e5cd1f 100644 > --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c > +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c > @@ -832,6 +832,9 @@ static u64 etm4_get_ns_access_type(struct etmv4_config *config) > { > u64 access_type = 0; > > + if (IS_ENABLED(CONFIG_CORESIGHT_ETM4X_EXCL_KERN)) > + config->mode |= ETM_MODE_EXCL_KERN; > + > /* > * EXLEVEL_NS, bits[15:12] > * The Exception levels are: > @@ -849,7 +852,8 @@ static u64 etm4_get_ns_access_type(struct etmv4_config *config) > access_type = ETM_EXLEVEL_NS_HYP; > } > > - if (config->mode & ETM_MODE_EXCL_USER) > + if (config->mode & ETM_MODE_EXCL_USER && > + !IS_ENABLED(CONFIG_CORESIGHT_ETM4X_EXCL_KERN)) > access_type |= ETM_EXLEVEL_NS_APP; > > return access_type; > > base-commit: 3477326277451000bc667dfcc4fd0774c039184c > -- > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member > of Code Aurora Forum, hosted by The Linux Foundation >