On Fri, Apr 16, 2021 at 8:59 PM Peter Zijlstra <pet...@infradead.org> wrote: > > On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote: > > On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <pet...@infradead.org> wrote: > > > > > > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote: > > > > > > > > So I think we've had proposals for being able to close fds in the > > > > > past; > > > > > while preserving groups etc. We've always pushed back on that because > > > > > of > > > > > the resource limit issue. By having each counter be a filedesc we get > > > > > a > > > > > natural limit on the amount of resources you can consume. And in that > > > > > respect, having to use 400k fds is things working as designed. > > > > > > > > > > Anyway, there might be a way around this.. > > > > > > So how about we flip the whole thing sideways, instead of doing one > > > event for multiple cgroups, do an event for multiple-cpus. > > > > > > Basically, allow: > > > > > > perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP); > > > > > > Which would have the kernel create nr_cpus events [the corrolary is that > > > we'd probably also allow: (.pid=-1, cpu=-1) ]. > > > > Do you mean it'd have separate perf_events per cpu internally? > > From a cpu's perspective, there's nothing changed, right? > > Then it will have the same performance problem as of now. > > Yes, but we'll not end up in ioctl() hell. The interface is sooo much > better. The performance thing just means we need to think harder. > > I thought cgroup scheduling got a lot better with the work Ian did a > while back? What's the actual bottleneck now?
Yep, that's true but it still comes with a high cost of multiplexing in context (cgroup) switch. It's inefficient that it programs the PMU with exactly the same config just for a different cgroup. You know accessing the MSRs is no cheap operation. > > > > Output could be done by adding FORMAT_PERCPU, which takes the current > > > read() format and writes a copy for each CPU event. (p)read(v)() could > > > be used to explode or partial read that. > > > > Yeah, I think it's good for read. But what about mmap? > > I don't think we can use file offset since it's taken for auxtrace. > > Maybe we can simply disallow that.. > > Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP > like thing for mmap(), but I never implemented that (didn't get the > enthousiatic response I thought it would). But yeah, there's nowhere > near enough space in there for PERCPU. Recently there's a patch to do it with rdpmc which needs to mmap first. https://lore.kernel.org/lkml/20210414155412.3697605-1-r...@kernel.org/ > > Not sure how to do that, these counters must not be sampling counters > because we can't be sharing a buffer from multiple CPUs, so data/aux > just isn't a concern. But it's weird to have them magically behave > differently. Yeah it's weird, and we should limit the sampling use case. Thanks, Namhyung