On 04/03/2015 08:08 PM, Linus Torvalds wrote: > On Fri, Apr 3, 2015 at 9:54 AM, Denys Vlasenko <dvlas...@redhat.com> wrote: >> >> How about this version? >> It's still isn't a star of readability, >> but the structure of the 32-byte code block is more visible now... > > Do we really even want to be this clever in the first place? > > The thing is, when we take an interrupt: > > (a) the L1 I$ is always cold > > (b) the instruction decoder has never had time to run ahead > > (c) there are usually not that many different interrupts anyway, even > under load (ie you'd have maybe disk and networking) > > (d) we intentionally spread out the different interrupt vector numbers > > (e) the 32-byte block thing is questionable, most older > micro-architectures fetch in 16-byte blocks iirc. > > So what this tells me is that: > > - (a+b) the jump-to-jump is likely fairly expensive, because even > though they are in the same cacheline, the front end hasn't gotten > ahead of anything, so there's no hiding any front end pipeline > hickups. > > - (c+d) there is likely very little advantage to trying to "pack" > things in cachelines
Good points. > - (d+e) the 7-instructions-in-one-32-byte-block doesn't really sound > all that big of a win, and it does cause a 16-byte split for some > interrupt. No, this doesn't happen. With current code, none of instructions cross 16-byte split. Even 8-byte boundary is never crossed. > In other words, I'd suggest that we just use simple unconditional > 5-byte branch instead. Add the two-byte "push" instruction, you have 7 > bytes per interrupt. Align that 7 bytes up to 8, and none of them ever > cross a 16-byte boundary. > > Simple, clean, and slightly bigger in memory footprint, but probably > not noticeably more so in cache footprint, simply because there > usually aren't that many active interrupts anyway. > > The people who do millions of networking interrupts per second and > have network cards that steer things to many different interrupts > already try to make sure that the steering goes to different CPU's - > otherwise there wouldn't be any *point* to steering things. So that > particular case of "lots of active interrupts" doesn't have a bigger > cache footprint *either*, since any particular CPU L1 I$ will still > only handle a few interrupts. > > So you get "only" 4 interrupt cases per 32 bytes rather than 7. But is > that odd double jump and all this complexity really worth it? > > So I really suggest just doing something stupid and straightforward > (and completely untested) like this: > > .macro push_vector > pushq_cfi $(~vector+0x80) > jmp common_interrupt > .align 8 > .endm > > vector=FIRST_EXTERNAL_VECTOR > .align 64 > ENTRY(irq_entries_start) > .rept 256 /* this number does not need to be exact, just big enough */ > make_vector > .endr > > and just be done with it. > > (Of course, you have to change the code that knows about the "7 > entries in 32 bytes" patterns too, but that's just going to be much > simpler now). I'll send a patch in ~30 minutes. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/