Re: [PATCH RFC v3 for-6.8/block 07/17] bcachefs: remove dead function bdev_sectors()
On Thu, Dec 21, 2023 at 04:57:02PM +0800, Yu Kuai wrote: > From: Yu Kuai > > bdev_sectors() is not used hence remove it. > > Signed-off-by: Yu Kuai Acked-by: Kent Overstreet > --- > fs/bcachefs/util.h | 5 - > 1 file changed, 5 deletions(-) > > diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h > index 2984b57b2958..22a0acc1704f 100644 > --- a/fs/bcachefs/util.h > +++ b/fs/bcachefs/util.h > @@ -516,11 +516,6 @@ static inline unsigned fract_exp_two(unsigned x, > unsigned fract_bits) > void bch2_bio_map(struct bio *bio, void *base, size_t); > int bch2_bio_alloc_pages(struct bio *, size_t, gfp_t); > > -static inline sector_t bdev_sectors(struct block_device *bdev) > -{ > - return bdev->bd_inode->i_size >> 9; > -} > - > #define closure_bio_submit(bio, cl) \ > do { \ > closure_get(cl);\ > -- > 2.39.2 >
Re: [PATCH RFC v3 for-6.8/block 09/17] btrfs: use bdev apis
On Sat, Dec 23, 2023 at 05:31:55PM +, Matthew Wilcox wrote: > On Thu, Dec 21, 2023 at 04:57:04PM +0800, Yu Kuai wrote: > > @@ -3674,16 +3670,17 @@ struct btrfs_super_block > > *btrfs_read_dev_one_super(struct block_device *bdev, > > * Drop the page of the primary superblock, so later read will > > * always read from the device. > > */ > > - invalidate_inode_pages2_range(mapping, > > - bytenr >> PAGE_SHIFT, > > + invalidate_bdev_range(bdev, bytenr >> PAGE_SHIFT, > > (bytenr + BTRFS_SUPER_INFO_SIZE) >> PAGE_SHIFT); > > } > > > > - page = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS); > > - if (IS_ERR(page)) > > - return ERR_CAST(page); > > + nofs_flag = memalloc_nofs_save(); > > + folio = bdev_read_folio(bdev, bytenr); > > + memalloc_nofs_restore(nofs_flag); > > This is the wrong way to use memalloc_nofs_save/restore. They should be > used at the point that the filesystem takes/releases whatever lock is > also used during reclaim. I don't know btrfs well enough to suggest > what lock is missing these annotations. Yes, but considering this is a cross-filesystem cleanup I wouldn't want to address that in this patchset. And the easier, more incremental approach for the conversion would be to first convert every GFP_NOFS usage to memalloc_nofs_save() like this patch does, as small local changes, and then let the btrfs people combine them and move them to the approproate location in a separate patchstet.
Re: [PATCH block/for-next v2 07/16] bcachefs: use new helper to get inode from block_device
On Mon, Nov 27, 2023 at 04:09:47PM +0900, Damien Le Moal wrote: > On 11/27/23 15:21, Yu Kuai wrote: > > From: Yu Kuai > > > > Which is more efficiency, and also prepare to remove the field > > 'bd_inode' from block_device. > > > > Signed-off-by: Yu Kuai > > --- > > fs/bcachefs/util.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h > > index 2984b57b2958..fe7ccb3a3517 100644 > > --- a/fs/bcachefs/util.h > > +++ b/fs/bcachefs/util.h > > @@ -518,7 +518,7 @@ int bch2_bio_alloc_pages(struct bio *, size_t, gfp_t); > > > > static inline sector_t bdev_sectors(struct block_device *bdev) > > { > > - return bdev->bd_inode->i_size >> 9; > > + return bdev_inode(bdev)->i_size >> 9; > > shouldn't this use i_size_read() ? > > I missed the history with this but why not use bdev_nr_sectors() and delete > this > helper ? Actually, this helper seems to be dead code.
Re: [PATCH block/for-next v2 07/16] bcachefs: use new helper to get inode from block_device
On Mon, Nov 27, 2023 at 02:21:07PM +0800, Yu Kuai wrote: > From: Yu Kuai > > Which is more efficiency, and also prepare to remove the field > 'bd_inode' from block_device. > > Signed-off-by: Yu Kuai Acked-by: Kent Overstreet > --- > fs/bcachefs/util.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h > index 2984b57b2958..fe7ccb3a3517 100644 > --- a/fs/bcachefs/util.h > +++ b/fs/bcachefs/util.h > @@ -518,7 +518,7 @@ int bch2_bio_alloc_pages(struct bio *, size_t, gfp_t); > > static inline sector_t bdev_sectors(struct block_device *bdev) > { > - return bdev->bd_inode->i_size >> 9; > + return bdev_inode(bdev)->i_size >> 9; > } > > #define closure_bio_submit(bio, cl) \ > -- > 2.39.2 >
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote: > Then you have probably missed a huge part of my emails. Please > re-read. If those arguments are not clear, feel free to ask for > clarification. Reducing the whole my reasoning and objections to the > sentence above and calling that vapid and lazy is not only unfair but > also disrespectful. What, where you complained about slab's page allocations showing up in the profile instead of slab, and I pointed out to you that actually each and every slab call is instrumented, and you're just seeing some double counting (that we will no doubt fix?) Or when you complained about allocation sites where it should actually be the caller that should be instrumented, and I pointed out that it'd be quite easy to simply change that code to use _kmalloc() and slab_tag_add() directly, if it becomes an issue. Of course, if we got that far, we'd have this code to thank for telling us where to look! Did I miss anything?
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Sep 07, 2022 at 11:49:37PM -0700, Suren Baghdasaryan wrote: > I would really appreciate if everyone could please stick to the > technical side of the conversation. That way we can get some > constructive feedback. Everything else is not helpful and at best is a > distraction. > Maintenance burden is a price we pay and I think it's the prerogative > of the maintainers to take that into account. Our job is to prove that > the price is worth paying. Well said. I'd also like to add - slab.h does look pretty overgrown and messy. We've grown a _lot_ of special purpose memory allocation interfaces, and I think it probably is time to try and wrangle that back. The API complexity isn't just an issue for this patch - it's an issue for anything that has to wrap and plumb through memory allocation interfaces. It's a pain point for the Rust people, and also comes in e.g. the mempool API. I think we should keep going with the memalloc_no*_save()/restore() approach, and extend it to other things: - memalloc_nowait_save() - memalloc_highpri_save() (these two get you GFP_ATOMIC). Also, I don't think these all need to be separate functions, we could have memalloc_gfp_apply() memalloc_gfp_restore() which simply takes a gfp flags argument and applies it to the current PF_MEMALLOC flags. We've had long standing bugs where vmalloc() can't correctly take gfp flags because some of the allocations it does for page tables don't have it correctly plumbed through; switching to the memalloc_*_(save|restore) is something people have been wanting in order to fix this - for years. Actually following through and completing this would let us kill the gfp flags arguments to our various memory allocators entirely. I think we can do the same thing with the numa node parameter - kill kmalloc_node() et. all, move it to task_struct with a set of save/restore functions. There's probably other things we can do to simplify slab.h if we look more. I've been hoping to start pushing patches for some of this stuff - it's going to be some time before I can get to it though, can only handle so many projects in flight at a time :)
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote: > On Wed, 7 Sep 2022 09:04:28 -0400 > Kent Overstreet wrote: > > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > > > Hmm, it seems that further discussion doesn't really make much sense > > > here. I know how to use my time better. > > > > Just a thought, but I generally find it more productive to propose ideas > > than to > > just be disparaging. > > > > But it's not Michal's job to do so. He's just telling you that the given > feature is not worth the burden. He's telling you the issues that he has > with the patch set. It's the submitter's job to address those concerns and > not the maintainer's to tell you how to make it better. > > When Linus tells us that a submission is crap, we don't ask him how to make > it less crap, we listen to why he called it crap, and then rewrite to be > not so crappy. If we cannot figure it out, it doesn't get in. When Linus tells someone a submission is crap, he _always_ has a sound, and _specific_ technical justification for doing so. "This code is going to be a considerable maintenance burden" is vapid, and lazy. It's the kind of feedback made by someone who has looked at the number of lines of code a patch touches and not much more.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote: > Hmm, it seems that further discussion doesn't really make much sense > here. I know how to use my time better. Just a thought, but I generally find it more productive to propose ideas than to just be disparaging. Cheers, Kent
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Tue, Sep 06, 2022 at 09:23:31AM +0200, Michal Hocko wrote: > On Mon 05-09-22 19:46:49, Kent Overstreet wrote: > > On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote: > > > This is really my main concern about this whole work. Not only it adds a > > > considerable maintenance burden to the core MM because > > > > [citation needed] > > I thought this was clear from the email content (the part you haven't > quoted here). But let me be explicit one more time for you. > > I hope we can agree that in order for this kind of tracking to be useful > you need to cover _callers_ of the allocator or in the ideal world > the users/owner of the tracked memory (the later is sometimes much > harder/impossible to track when the memory is handed over from one peer > to another). > > It is not particularly useful IMO to see that a large portion of the > memory has been allocated by say vmalloc or kvmalloc, right? How > much does it really tell you that a lot of memory has been allocated > by kvmalloc or vmalloc? Yet, neither of the two is handled by the > proposed tracking and it would require additional code to be added and > _maintained_ to cover them. But that would be still far from complete, > we have bulk allocator, mempools etc. Of course - and even a light skimming of the patch set would see it does indeed address this. We still have to do vmalloc and percpu memory allocations, but slab is certainly handled and that's the big one. > As pointed above this just scales poorly and adds to the API space. Not > to mention that direct use of alloc_tag_add can just confuse layers > below which rely on the same thing. It might help you make your case if you'd say something about what you'd like better. Otherwise, saying "code has to be maintained" is a little bit like saying water is wet, and we're all engineers here, I think we know that :)
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Mon, Sep 05, 2022 at 06:16:50PM -0400, Steven Rostedt wrote: > On Mon, 5 Sep 2022 16:42:29 -0400 > Kent Overstreet wrote: > > > > Haven't tried that yet but will do. Thanks for the reference code! > > > > Is it really worth the effort of benchmarking tracing API overhead here? > > > > The main cost of a tracing based approach is going to to be the data > > structure > > for remembering outstanding allocations so that free events can be matched > > to > > the appropriate callsite. Regardless of whether it's done with BFP or by > > attaching to the tracepoints directly, that's going to be the main overhead. > > The point I was making here is that you do not need your own hooking > mechanism. You can get the information directly by attaching to the > tracepoint. > > > > static void my_callback(void *data, unsigned long call_site, > > > const void *ptr, struct kmem_cache *s, > > > size_t bytes_req, size_t bytes_alloc, > > > gfp_t gfp_flags) > > > { > > > struct my_data_struct *my_data = data; > > > > > > { do whatever } > > > } > > The "do whatever" is anything you want to do. > > Or is the data structure you create with this approach going to be too much > overhead? How hard is it for a hash or binary search lookup? If you don't think it's hard, go ahead and show us.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote: > This is really my main concern about this whole work. Not only it adds a > considerable maintenance burden to the core MM because [citation needed] > it adds on top of > our existing allocator layers complexity but it would need to spread beyond > MM to be useful because it is usually outside of MM where leaks happen. If you want the tracking to happen at a different level of the call stack, just call _kmalloc() directly and call alloc_tag_add()/sub() yourself.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Mon, Sep 05, 2022 at 11:08:21AM -0700, Suren Baghdasaryan wrote: > On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt wrote: > > > > On Sun, 4 Sep 2022 18:32:58 -0700 > > Suren Baghdasaryan wrote: > > > > > Page allocations (overheads are compared to get_free_pages() duration): > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + > > > __alloc_tag_add) > > > 8.8% lookup_page_ext > > > 1237% call stack capture > > > 139% tracepoint with attached empty BPF program > > > > Have you tried tracepoint with custom callback? > > > > static void my_callback(void *data, unsigned long call_site, > > const void *ptr, struct kmem_cache *s, > > size_t bytes_req, size_t bytes_alloc, > > gfp_t gfp_flags) > > { > > struct my_data_struct *my_data = data; > > > > { do whatever } > > } > > > > [..] > > register_trace_kmem_alloc(my_callback, my_data); > > > > Now the my_callback function will be called directly every time the > > kmem_alloc tracepoint is hit. > > > > This avoids that perf and BPF overhead. > > Haven't tried that yet but will do. Thanks for the reference code! Is it really worth the effort of benchmarking tracing API overhead here? The main cost of a tracing based approach is going to to be the data structure for remembering outstanding allocations so that free events can be matched to the appropriate callsite. Regardless of whether it's done with BFP or by attaching to the tracepoints directly, that's going to be the main overhead.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote: > I've complained about memcg accounting before, the slowness of it is why > io_uring works around it by caching. Anything we account we try NOT do > in the fast path because of it, the slowdown is considerable. I'm with you on that, it definitely raises an eyebrow. > You care about efficiency now? I thought that was relegated to > irrelevant 10M IOPS cases. I always did, it's just not the only thing I care about.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote: > On 9/1/22 7:04 PM, Roman Gushchin wrote: > > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > >> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > >>> I'd suggest to run something like iperf on a fast hardware. And maybe some > >>> io_uring stuff too. These are two places which were historically most > >>> sensitive > >>> to the (kernel) memory accounting speed. > >> > >> I'm getting wildly inconsistent results with iperf. > >> > >> io_uring-echo-server and rust_echo_bench gets me: > >> Benchmarking: 127.0.0.1:12345 > >> 50 clients, running 512 bytes, 60 sec. > >> > >> Without alloc tagging: 120547 request/sec > >> With: 116748 request/sec > >> > >> https://github.com/frevib/io_uring-echo-server > >> https://github.com/haraldh/rust_echo_bench > >> > >> How's that look to you? Close enough? :) > > > > Yes, this looks good (a bit too good). > > > > I'm not that familiar with io_uring, Jens and Pavel should have a better > > idea > > what and how to run (I know they've workarounded the kernel memory > > accounting > > because of the performance in the past, this is why I suspect it might be an > > issue here as well). > > io_uring isn't alloc+free intensive on a per request basis anymore, it > would not be a good benchmark if the goal is to check for regressions in > that area. Good to know. The benchmark is still a TCP benchmark though, so still useful. Matthew suggested while true; do echo 1 >/tmp/foo; rm /tmp/foo; done I ran that on tmpfs, and the numbers with and without alloc tagging were statistically equal - there was a fair amount of variation, it wasn't a super controlled test, anywhere from 38-41 seconds with 10 iterations (and alloc tagging was some of the faster runs). But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same mechanism memcg uses for stashing per-object pointers, so it looks like that's the bigger cost.
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote: > If ftrace, perf, bpf can't do what you want, take a harder look to see if > you can modify them to do so. Maybe we can use this exchange to make both of our tools better. I like your histograms - the quantiles algorithm I've had for years is janky, I've been meaning to rip that out, I'd love to take a look at your code for that. And having an on/off switch is a good idea, I'll try to add that at some point. Maybe you got some ideas from my stuff too. I'd love to get better tracepoints for measuring latency - what I added to init_wait() and finish_wait() was really only a starting point. Figuring out the right places to measure is where I'd like to be investing my time in this area, and there's no reason we couldn't both be making use of that. e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but not all code uses that, init_wait() got me better coverage. But I've already seen that that misses things, too, there's more work to be done. random thought: might try adding a warning in schedule() any time it's called and codetag_time_stats_start() hasn't been called, that'll be a starting point...
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote: > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote: > > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > > > I'd suggest to run something like iperf on a fast hardware. And maybe some > > > io_uring stuff too. These are two places which were historically most > > > sensitive > > > to the (kernel) memory accounting speed. > > > > I'm getting wildly inconsistent results with iperf. > > > > io_uring-echo-server and rust_echo_bench gets me: > > Benchmarking: 127.0.0.1:12345 > > 50 clients, running 512 bytes, 60 sec. > > > > Without alloc tagging: 120547 request/sec > > With: 116748 request/sec > > > > https://github.com/frevib/io_uring-echo-server > > https://github.com/haraldh/rust_echo_bench > > > > How's that look to you? Close enough? :) > > Yes, this looks good (a bit too good). Eh, I was hoping for better :) > I'm not that familiar with io_uring, Jens and Pavel should have a better idea > what and how to run (I know they've workarounded the kernel memory accounting > because of the performance in the past, this is why I suspect it might be an > issue here as well). > > This is a recent optimization on the networking side: > https://lore.kernel.org/linux-mm/20220825000506.239406-1-shake...@google.com/ > > Maybe you can try to repeat this experiment. I'd be more interested in a synthetic benchmark, if you know of any.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote: > I'd suggest to run something like iperf on a fast hardware. And maybe some > io_uring stuff too. These are two places which were historically most > sensitive > to the (kernel) memory accounting speed. I'm getting wildly inconsistent results with iperf. io_uring-echo-server and rust_echo_bench gets me: Benchmarking: 127.0.0.1:12345 50 clients, running 512 bytes, 60 sec. Without alloc tagging: 120547 request/sec With: 116748 request/sec https://github.com/frevib/io_uring-echo-server https://github.com/haraldh/rust_echo_bench How's that look to you? Close enough? :)
Re: [RFC PATCH 28/30] Improved symbolic error names
On Thu, Sep 01, 2022 at 04:19:35PM -0700, Joe Perches wrote: > On Tue, 2022-08-30 at 14:49 -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet > > > > This patch adds per-error-site error codes, with error strings that > > include their file and line number. > > > > To use, change code that returns an error, e.g. > > return -ENOMEM; > > to > > return -ERR(ENOMEM); > > > > Then, errname() will return a string that includes the file and line > > number of the ERR() call, for example > > printk("Got error %s!\n", errname(err)); > > will result in > > Got error ENOMEM at foo.c:1234 > > Why? Something wrong with just using %pe ? > > printk("Got error %pe at %s:%d!\n", ERR_PTR(err), __FILE__, __LINE__); > > Likely __FILE__ and __LINE__ aren't particularly useful. That doesn't do what this patchset does. If it only did that, it wouldn't make much sense, would it? :) With this patchset, printk("Got error %pe!\n", ptr); prints out a file and line number, but it's _not_ the file/line number of the printk statement - it's the file/line number where the error originated! :)
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote: > On Thu, 1 Sep 2022 17:54:38 -0400 > Kent Overstreet wrote: > > > > So this looks like it's gotten better since I last looked, but it's still > > not > > there yet. > > > > Part of the problem is that the tracepoints themselves are in the wrong > > place: > > your end event is when a task is woken up, but that means spurious wakeups > > will > > The end event is when a task is scheduled onto the CPU. The start event is > the first time it is woken up. Yeah, that's not what I want. You're just tracing latency due to having more processes runnable than CPUs. I don't care about that for debugging, though! I specifically want latency at the wait_event() level, and related - every time a process blocked _on some condition_, until that condition became true. Not until some random, potentially spurious wakeup. > Not the prettiest thing to read. But hey, we got the full stack of where > these latencies happened! Most of the time I _don't_ want full stacktraces, though! That means I have a ton more output to sort through, and the data is far more expensive to collect. I don't know why it's what people go to first - see the page_owner stuff - but that doesn't get used much either because the output is _really hard to sort through_. Most of the time, just a single file and line number is all you want - and tracing has always made it hard to get at that. > Yes, it adds some overhead when the events are triggered due to the > stacktrace code, but it's extremely useful information. > > > > > So, it looks like tracing has made some progress over the past 10 years, > > but for debugging latency issues it's still not there yet in general. I > > I call BS on that statement. Just because you do not know what has been > added to the kernel in the last 10 years (like you had no idea about > seq_buf and that was added in 2014) means to me that you are totally > clueless on what tracing can and can not do. > > It appears to me that you are too focused on inventing your own wheel that > does exactly what you want before looking to see how things are today. Just > because something didn't fit your needs 10 years ago doesn't mean that it > can't fit your needs today. ...And the ad hominem attacks start. Steve, I'm not attacking you, and there's room enough in this world for the both of us to be doing our thing creating new and useful tools. > I'm already getting complaints from customers/users that are saying there's > too many tools in the toolbox already. (Do we use ftrace/perf/bpf?). The > idea is to have the tools using mostly the same infrastructure, and not be > 100% off on its own, unless there's a clear reason to invent a new wheel > that several people are asking for, not just one or two. I would like to see more focus on usability. That means, in a best case scenario, always-on data collection that I can just look at, and it'll already be in the format most likely to be useful. Surely you can appreciate the usefulness of that..? Tracing started out as a tool for efficiently getting lots of data out of the kernel, and it's great for that. But I think your focus on the cool thing you built may be blinding you a bit to alternative approaches...
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote: > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote: > > This is very interesting work! Do you have any data about the overhead > > this introduces, especially in a production environment? I am > > especially interested in memory allocations tracking and detecting > > leaks. > > +1 > > I think the question whether it indeed can be always turned on in the > production > or not is the main one. If not, the advantage over ftrace/bpf/... is not that > obvious. Otherwise it will be indeed a VERY useful thing. Low enough overhead to run in production was my primary design goal. Stats are kept in a struct that's defined at the callsite. So this adds _no_ pointer chasing to the allocation path, unless we've switch to percpu counters at that callsite (see the lazy percpu counters patch), where we need to deref one percpu pointer to save an atomic. Then we need to stash a pointer to the alloc_tag, so that kfree() can find it. For slab allocations this uses the same storage area as memcg, so for allocations that are using that we won't be touching any additional cachelines. (I wanted the pointer to the alloc_tag to be stored inline with the allocation, but that would've caused alignment difficulties). Then there's a pointer deref introduced to the kfree() path, to get back to the original alloc_tag and subtract the allocation from that callsite. That one won't be free, and with percpu counters we've got another dependent load too - hmm, it might be worth benchmarking with just atomics, skipping the percpu counters. So the overhead won't be zero, I expect it'll show up in some synthetic benchmarks, but yes I do definitely expect this to be worth enabling in production in many scenarios.
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 05:38:44PM -0400, Steven Rostedt wrote: > On Tue, 30 Aug 2022 14:49:16 -0700 > Suren Baghdasaryan wrote: > > > From: Kent Overstreet > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > So you need to modify the code to see what you want? Figuring out the _correct_ place to measure is often a significant amount of the total effort. Having done so once, why not annotate that in the source code? > For function length you could just do something like this: > > # cd /sys/kernel/tracing > # echo __skb_wait_for_more_packets > set_ftrace_filter > # echo 1 > function_profile_enabled > # cat trace_stat/function* > Function HitTimeAvg > s^2 > ------ > --- > __skb_wait_for_more_packets 10.000 us0.000 us > 0.000 us > Function HitTimeAvg > s^2 > ------ > --- > __skb_wait_for_more_packets 174.813 us 74.813 us > 0.000 us > Function HitTimeAvg > s^2 > ------ > --- > Function HitTimeAvg > s^2 > ------ > --- > > The above is for a 4 CPU machine. The s^2 is the square of the standard > deviation (makes not having to do divisions while it runs). > > But if you are looking for latency between two events (which can be kprobes > too, where you do not need to rebuild your kernel): > > From: https://man.archlinux.org/man/sqlhist.1.en > which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/ > if not already installed on your distro. > > # sqlhist -e -n wakeup_lat 'select end.next_comm as > comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as > delta from sched_waking as start join sched_switch as end on start.pid = > end.next_pid where start.prio < 100' > > The above creates a synthetic event called "wakeup_lat" that joins two > events (sched_waking and sched_switch) when the pid field of sched_waking > matches the next_pid field of sched_switch. When there is a match, it will > trigger the wakeup_lat event only if the prio of the sched_waking event is > less than 100 (which in the kernel means any real-time task). The > wakeup_lat event will record the next_comm (as comm field), the pid of > woken task and the time delta in microseconds between the two events. So this looks like it's gotten better since I last looked, but it's still not there yet. Part of the problem is that the tracepoints themselves are in the wrong place: your end event is when a task is woken up, but that means spurious wakeups will cause one wait_event() call to be reported as multiple smaller waits, not one long wait - oops, now I can't actually find the thing that's causing my multi-second delay. Also, in your example you don't have it broken out by callsite. That would be the first thing I'd need for any real world debugging. So, it looks like tracing has made some progress over the past 10 years, but for debugging latency issues it's still not there yet in general. I will definitely remember function latency tracing the next time I'm doing performance work, but I expect that to be far too heavy to enable on a live server. This thing is only a couple hundred lines of code though, so perhaps tracing shouldn't be the only tool in our toolbox :)
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote: > kmemleak is known to be slow and it's even documented [1], so I hope I > can skip that part. For page_owner to provide the comparable > information we would have to capture the call stacks for all page > allocations unlike our proposal which allows to do that selectively > for specific call sites. I'll post the overhead numbers of call stack > capturing once I'm finished with profiling the latest code, hopefully > sometime tomorrow, in the worst case after the long weekend. To expand on this further: we're stashing a pointer to the alloc_tag, which is defined at the allocation callsite. That's how we're able to decrement the proper counter on free, and why this beats any tracing based approach - with tracing you'd instead have to correlate allocate/free events. Ouch. > > Yes, tracking back the call trace would be really needed. The question > > is whether this is really prohibitively expensive. How much overhead are > > we talking about? There is no free lunch here, really. You either have > > the overhead during runtime when the feature is used or on the source > > code level for all the future development (with a maze of macros and > > wrappers). The full call stack is really not what you want in most applications - that's what people think they want at first, and why page_owner works the way it does, but it turns out that then combining all the different but related stack traces _sucks_ (so why were you saving them in the first place?), and then you have to do a separate memory allocate for each stack track, which destroys performance. > > Will post the overhead numbers soon. > What I hear loud and clear is that we need a kernel command-line kill > switch that mitigates the overhead for having this feature. That seems > to be the main concern. > Thanks, After looking at this more I don't think we should commit just yet - there's some tradeoffs to be evaluated, and maybe the thing to do first will be to see if we can cut down on the (huge!) number of allocation interfaces before adding more complexity. The ideal approach, from a performance POV, would be to pass a pointer to the alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in one place and use a jump label to skip over it when this feature is disabled. However, there are _many, many_ wrapper functions in our allocation code, and this approach is going to make the plumbing for the hooks quite a bit bigger than what we have now - and then, do we want to have this extra alloc_tag parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an extra unused parameter, but it's a cost - or do we get rid of that with some extra macro hackery (eww, gross)? If we do the boot parameter before submission, I think we'll have something that's maybe not strictly ideal from a performance POV when CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum amount of macro insanity. What we should be able to do pretty easily is discard the alloc_tag structs when the boot parameter is disabled, because they're in special elf sections and we already do that (e.g. for .init).
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote: > As pointed out elsewhere, attaching to the tracepoint and recording relevant > state is an option other than trying to parse a raw ftrace feed. For memory > leaks, there are already tracepoints for page allocation and free that could > be used to track allocations that are not freed at a given point in time. Page allocation tracepoints are not sufficient for what we're trying to do here, and a substantial amount of effort in this patchset has gone into just getting the hooking locations right - our memory allocation interfaces are not trivial. That's something people should keep in mind when commenting on the size of this patchset, since that's effort that would have to be spent for /any/ complete solution, be in tracepoint based or no. Additionally, we need to be able to write assertions that verify that our hook locations are correct, that allocations or frees aren't getting double counted or missed - highly necessary given the maze of nested memory allocation interfaces we have (i.e. slab.h), and it's something a tracepoint based implementation would have to account for - otherwise, a tool isn't very useful if you can't trust the numbers it's giving you. And then you have to correlate the allocate and free events, so that you know which allocate callsite to decrement the amount freed from. How would you plan on doing that with tracepoints? > There is also the kernel memory leak detector although I never had reason > to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html) > and it sounds like it would be expensive. Kmemleak is indeed expensive, and in the past I've had issues with it not catching everything (I've noticed the kmemleak annotations growing, so maybe this is less of an issue than it was). And this is a more complete solution (though not something that could strictly replace kmemleak): strict memory leaks aren't the only issue, it's also drivers unexpectedly consuming more memory than expected. I'll bet you a beer that when people have had this awhile, we're going to have a bunch of bugs discovered and fixed along the lines of "oh hey, this driver wasn't supposed to be using this 1 MB of memory, I never noticed that before". > > > It's also unclear *who* would enable this. It looks like it would mostly > > > have value during the development stage of an embedded platform to track > > > kernel memory usage on a per-application basis in an environment where it > > > may be difficult to setup tracing and tracking. Would it ever be enabled > > > in production? Would a distribution ever enable this? If it's enabled, any > > > overhead cannot be disabled/enabled at run or boot time so anyone enabling > > > this would carry the cost without never necessarily consuming the data. > > > > The whole point of this is to be cheap enough to enable in production - > > especially the latency tracing infrastructure. There's a lot of value to > > always-on system visibility infrastructure, so that when a live machine > > starts > > to do something wonky the data is already there. > > > > Sure, there is value but nothing stops the tracepoints being attached as > a boot-time service where interested. For latencies, there is already > bpf examples for tracing individual function latency over time e.g. > https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although > I haven't used it recently. So this is cool, I'll check it out today. Tracing of /function/ latency is definitely something you'd want tracing/kprobes for - that's way more practical than any code tagging-based approach. And if the output is reliable and useful I could definitely see myself using this, thank you. But for data collection where it makes sense to annotate in the source code where the data collection points are, I see the code-tagging based approach as simpler - it cuts out a whole bunch of indirection. The diffstat on the code tagging time stats patch is 8 files changed, 233 insertions(+), 6 deletions(-) And that includes hooking wait.h - this is really simple, easy stuff. The memory allocation tracking patches are more complicated because we've got a ton of memory allocation interfaces and we're aiming for strict correctness there - because that tool needs strict correctness in order to be useful. > Live parsing of ftrace is possible, albeit expensive. > https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl > tracks counts of high-order allocations and dumps a report on interrupt as > an example of live parsing ftrace and only recording interesting state. It's > not tracking state you are interested in but it demonstrates it is possible > to rely on ftrace alone and monitor from userspace. It's bit-rotted but > can be fixed with Yeah, if this is as far as people have gotten with ftrace on memory allocations than I don't think tracing is credible here, sorry. > The ease of use is a criticism as there
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote: > Skimming over the patches (that I was CCed on) and skimming over the > cover letter, I got the impression that everything after patch 7 is > introducing something new instead of refactoring something out. You skimmed over the dynamic debug patch then...
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, Sep 01, 2022 at 10:48:39AM -0400, Steven Rostedt wrote: > On Thu, 1 Sep 2022 10:32:19 -0400 > Kent Overstreet wrote: > > > On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > > > raw_lazy_percpu_counter *c) > > > > +{ > > > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, > > > > GFP_ATOMIC|__GFP_NOWARN); > > > > > > Realize that this is incorrect when used under a raw_spinlock_t. > > > > Can you elaborate? > > All allocations (including GFP_ATOMIC) grab normal spin_locks. When > PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as > raw_spinlock's do not. > > Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it > can cause a schedule while holding a spinlock. Thanks, I think we should be good here but I'll document it anyways.
Re: [RFC PATCH 27/30] Code tagging based latency tracking
On Thu, Sep 01, 2022 at 09:11:17AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:49:16PM -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet > > > > This adds the ability to easily instrument code for measuring latency. > > To use, add the following to calls to your code, at the start and end of > > the event you wish to measure: > > > > code_tag_time_stats_start(start_time); > > code_tag_time_stats_finish(start_time); > > > > Stastistics will then show up in debugfs under > > /sys/kernel/debug/time_stats, listed by file and line number. > > > > Stastics measured include weighted averages of frequency, duration, max > > duration, as well as quantiles. > > > > This patch also instruments all calls to init_wait and finish_wait, > > which includes all calls to wait_event. Example debugfs output: > > How can't you do this with a simple eBPF script on top of > trace_sched_stat_* and friends? I know about those tracepoints, and I've never found them to be usable. I've never succesfully used them for debugging latency issues, or known anyone who has. And an eBPF script to do everything this does wouldn't be simple at all. Honesly, the time stats stuff looks _far_ simpler to me than anything involving tracing - and with tracing you have to correlate the start and end events after the fact.
Re: [RFC PATCH 03/30] Lazy percpu counters
On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > +static void lazy_percpu_counter_switch_to_pcpu(struct > > raw_lazy_percpu_counter *c) > > +{ > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN); > > Realize that this is incorrect when used under a raw_spinlock_t. Can you elaborate?
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote: > On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > > > It's also unclear *who* would enable this. It looks like it would mostly > > have value during the development stage of an embedded platform to track > > kernel memory usage on a per-application basis in an environment where it > > may be difficult to setup tracing and tracking. Would it ever be enabled > > in production? > > Afaict this is developer only; it is all unconditional code. > > > Would a distribution ever enable this? > > I would sincerely hope not. Because: > > > If it's enabled, any overhead cannot be disabled/enabled at run or > > boot time so anyone enabling this would carry the cost without never > > necessarily consuming the data. > > this. We could make it a boot parameter, with the alternatives infrastructure - with a bit of refactoring there'd be a single function call to nop out, and then we could also drop the elf sections as well, so that when built in but disabled the overhead would be practically nil.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote: > On 31.08.22 21:01, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > >> On Wed 31-08-22 11:19:48, Mel Gorman wrote: > >>> Whatever asking for an explanation as to why equivalent functionality > >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > >> > >> Fully agreed and this is especially true for a change this size > >> 77 files changed, 3406 insertions(+), 703 deletions(-) > > > > In the case of memory allocation accounting, you flat cannot do this with > > ftrace > > - you could maybe do a janky version that isn't fully accurate, much slower, > > more complicated for the developer to understand and debug and more > > complicated > > for the end user. > > > > But please, I invite anyone who's actually been doing this with ftrace to > > demonstrate otherwise. > > > > Ftrace just isn't the right tool for the job here - we're talking about > > adding > > per callsite accounting to some of the fastest fast paths in the kernel. > > > > And the size of the changes for memory allocation accounting are much more > > reasonable: > > 33 files changed, 623 insertions(+), 99 deletions(-) > > > > The code tagging library should exist anyways, it's been open coded half a > > dozen > > times in the kernel already. > > Hi Kent, > > independent of the other discussions, if it's open coded already, does > it make sense to factor that already-open-coded part out independently > of the remainder of the full series here? It's discussed in the cover letter, that is exactly how the patch series is structured. > [I didn't immediately spot if this series also attempts already to > replace that open-coded part] Uh huh. Honestly, some days it feels like lkml is just as bad as slashdot, with people wanting to get in their two cents without actually reading...
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote: > On Wed 31-08-22 11:19:48, Mel Gorman wrote: > > Whatever asking for an explanation as to why equivalent functionality > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. > > Fully agreed and this is especially true for a change this size > 77 files changed, 3406 insertions(+), 703 deletions(-) In the case of memory allocation accounting, you flat cannot do this with ftrace - you could maybe do a janky version that isn't fully accurate, much slower, more complicated for the developer to understand and debug and more complicated for the end user. But please, I invite anyone who's actually been doing this with ftrace to demonstrate otherwise. Ftrace just isn't the right tool for the job here - we're talking about adding per callsite accounting to some of the fastest fast paths in the kernel. And the size of the changes for memory allocation accounting are much more reasonable: 33 files changed, 623 insertions(+), 99 deletions(-) The code tagging library should exist anyways, it's been open coded half a dozen times in the kernel already. And once we've got that, the time stats code is _also_ far simpler than doing it with ftrace would be. If anyone here has successfully debugged latency issues with ftrace, I'd really like to hear it. Again, for debugging latency issues you want something that can always be on, and that's not cheap with ftrace - and never mind the hassle of correlating start and end wait trace events, builting up histograms, etc. - that's all handled here. Cheap, simple, easy to use. What more could you want?
Re: [RFC PATCH 10/30] mm: enable page allocation tagging for __get_free_pages and alloc_pages
On Wed, Aug 31, 2022 at 11:11:03AM +0100, Mel Gorman wrote: > On Tue, Aug 30, 2022 at 02:48:59PM -0700, Suren Baghdasaryan wrote: > > Redefine alloc_pages, __get_free_pages to record allocations done by > > these functions. Instrument deallocation hooks to record object freeing. > > > > Signed-off-by: Suren Baghdasaryan > > +#ifdef CONFIG_PAGE_ALLOC_TAGGING > > + > > #include > > #include > > > > @@ -25,4 +27,37 @@ static inline void pgalloc_tag_dec(struct page *page, > > unsigned int order) > > alloc_tag_sub(get_page_tag_ref(page), PAGE_SIZE << order); > > } > > > > +/* > > + * Redefinitions of the common page allocators/destructors > > + */ > > +#define pgtag_alloc_pages(gfp, order) > > \ > > +({ \ > > + struct page *_page = _alloc_pages((gfp), (order)); \ > > + \ > > + if (_page) \ > > + alloc_tag_add(get_page_tag_ref(_page), PAGE_SIZE << (order));\ > > + _page; \ > > +}) > > + > > Instead of renaming alloc_pages, why is the tagging not done in > __alloc_pages()? At least __alloc_pages_bulk() is also missed. The branch > can be guarded with IS_ENABLED. It can't be in a function, it has to be in a wrapper macro. alloc_tag_add() is a macro that defines a static struct in a special elf section. That struct holds the allocation counters, and putting it in a special elf section is how the code to list it in debugfs finds it. Look at the dynamic debug code for prior precedence for this trick in the kernel - that's how it makes pr_debug() calls dynamically controllable at runtime, from debugfs. We're taking that method and turning it into a proper library. Because all the counters are statically allocated, without even a pointer deref to get to them in the allocation path (one pointer deref to get to them in the deallocate path), that makes this _much, much_ cheaper than anything that could be done with tracing - cheap enough that I expect many users will want to enable it in production.
Re: [RFC PATCH 22/30] Code tagging based fault injection
On Wed, Aug 31, 2022 at 12:37:14PM +0200, Dmitry Vyukov wrote: > On Tue, 30 Aug 2022 at 23:50, Suren Baghdasaryan wrote: > > > > From: Kent Overstreet > > > > This adds a new fault injection capability, based on code tagging. > > > > To use, simply insert somewhere in your code > > > > dynamic_fault("fault_class_name") > > > > and check whether it returns true - if so, inject the error. > > For example > > > > if (dynamic_fault("init")) > > return -EINVAL; > > Hi Suren, > > If this is going to be used by mainline kernel, it would be good to > integrate this with fail_nth systematic fault injection: > https://elixir.bootlin.com/linux/latest/source/lib/fault-inject.c#L109 > > Otherwise these dynamic sites won't be tested by testing systems doing > systematic fault injection testing. That's a discussion we need to have, yeah. We don't want two distinct fault injection frameworks, we'll have to have a discussion as to whether this is (or can be) better enough to make a switch worthwhile, and whether a compatibility interface is needed - or maybe there's enough distinct interesting bits in both to make merging plausible? The debugfs interface for this fault injection code is necessarily different from our existing fault injection - this gives you a fault injection point _per callsite_, which is huge - e.g. for filesystem testing what I need is to be able to enable fault injection points within a given module. I can do that easily with this, not with our current fault injection. I think the per-callsite fault injection points would also be pretty valuable for CONFIG_FAULT_INJECTION_USERCOPY, too. OTOH, existing kernel fault injection can filter based on task - this fault injection framework doesn't have that. Easy enough to add, though. Similar for the interval/probability/ratelimit stuff. fail_function is the odd one out, I'm not sure how that would fit into this model. Everything else I've seen I think fits into this model. Also, it sounds like you're more familiar with our existing fault injection than I am, so if I've misunderstood anything about what it can do please do correct me. Interestingly: I just discovered from reading the code that CONFIG_FAULT_INJECTION_STACKTRACE_FILTER is a thing (hadn't before because it depends on !X86_64 - what?). That's cool, though.
Re: [RFC PATCH 03/30] Lazy percpu counters
On Wed, Aug 31, 2022 at 11:02:49AM +0100, Mel Gorman wrote: > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote: > > From: Kent Overstreet > > > > This patch adds lib/lazy-percpu-counter.c, which implements counters > > that start out as atomics, but lazily switch to percpu mode if the > > update rate crosses some threshold (arbitrarily set at 256 per second). > > > > Signed-off-by: Kent Overstreet > > Why not use percpu_counter? It has a per-cpu counter that is synchronised > when a batch threshold (default 32) is exceeded and can explicitly sync > the counters when required assuming the synchronised count is only needed > when reading debugfs. It doesn't switch from atomic mode to percpu mode when the update rate crosses a threshold like lazy percpu counters does, it allocates all the percpu counters up front - that makes it a non starter here. Also, from my reading of the code... wtf is it even doing, and why would I use it at all? This looks like old grotty code from ext3, it's not even using this_cpu_add() - it does preempt_enable()/disable() just for adding to a local percpu counter! Noope.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote: > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote: > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > > > === > > > > Code tagging framework > > > > === > > > > Code tag is a structure identifying a specific location in the source > > > > code > > > > which is generated at compile time and can be embedded in an > > > > application- > > > > specific structure. Several applications of code tagging are included in > > > > this RFC, such as memory allocation tracking, dynamic fault injection, > > > > latency tracking and improved error code reporting. > > > > Basically, it takes the old trick of "define a special elf section for > > > > objects of a given type so that we can iterate over them at runtime" and > > > > creates a proper library for it. > > > > > > I might be super dense this morning, but what!? I've skimmed through the > > > set and I don't think I get it. > > > > > > What does this provide that ftrace/kprobes don't already allow? > > > > You're kidding, right? > > It's a valid question. From the description, it main addition that would > be hard to do with ftrace or probes is catching where an error code is > returned. A secondary addition would be catching all historical state and > not just state since the tracing started. Catching all historical state is pretty important in the case of memory allocation accounting, don't you think? Also, ftrace can drop events. Not really ideal if under system load your memory accounting numbers start to drift. > It's also unclear *who* would enable this. It looks like it would mostly > have value during the development stage of an embedded platform to track > kernel memory usage on a per-application basis in an environment where it > may be difficult to setup tracing and tracking. Would it ever be enabled > in production? Would a distribution ever enable this? If it's enabled, any > overhead cannot be disabled/enabled at run or boot time so anyone enabling > this would carry the cost without never necessarily consuming the data. The whole point of this is to be cheap enough to enable in production - especially the latency tracing infrastructure. There's a lot of value to always-on system visibility infrastructure, so that when a live machine starts to do something wonky the data is already there. What we've built here this is _far_ cheaper than anything that could be done with ftrace. > It might be an ease-of-use thing. Gathering the information from traces > is tricky and would need combining multiple different elements and that > is development effort but not impossible. > > Whatever asking for an explanation as to why equivalent functionality > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable. I think perhaps some of the expectation should be on the "ftrace for everything!" people to explain a: how their alternative could be even built and b: how it would compare in terms of performance and ease of use. Look, I've been a tracing user for many years, and it has its uses, but some of the claims I've been hearing from tracing/bpf people when any alternative tooling is proposed sound like vaporware and bullshitting.
Re: [RFC PATCH 00/30] Code tagging framework and applications
On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote: > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote: > > === > > Code tagging framework > > === > > Code tag is a structure identifying a specific location in the source code > > which is generated at compile time and can be embedded in an application- > > specific structure. Several applications of code tagging are included in > > this RFC, such as memory allocation tracking, dynamic fault injection, > > latency tracking and improved error code reporting. > > Basically, it takes the old trick of "define a special elf section for > > objects of a given type so that we can iterate over them at runtime" and > > creates a proper library for it. > > I might be super dense this morning, but what!? I've skimmed through the > set and I don't think I get it. > > What does this provide that ftrace/kprobes don't already allow? You're kidding, right?