* Andi Kleen <a...@linux.intel.com> wrote: > > So instead of this flat structure, there should at minimum be broad > > categorization > > of the various parts of the hardware they relate to: whether they relate to > > the > > branch predictor, memory caches, TLB caches, memory ops, offcore, decoders, > > execution units, FPU ops, etc., etc. - so that they can be queried via > > 'perf > > list'. > > The categorization is generally on the stem name, which already works fine > with > the existing perf list wildcard support. So for example you only want > branches. > > perf list br* > ... > br_inst_exec.all_branches > [Speculative and retired branches] > br_inst_exec.all_conditional > [Speculative and retired macro-conditional branches] > br_inst_exec.all_direct_jmp > [Speculative and retired macro-unconditional branches excluding calls > and indirects] > br_inst_exec.all_direct_near_call > [Speculative and retired direct near calls] > br_inst_exec.all_indirect_jump_non_call_ret > [Speculative and retired indirect branches excluding calls and returns] > br_inst_exec.all_indirect_near_return > [Speculative and retired indirect return branches] > ... > > Or mid level cache events: > > perf list l2* > ... > l2_l1d_wb_rqsts.all > [Not rejected writebacks from L1D to L2 cache lines in any state] > l2_l1d_wb_rqsts.hit_e > [Not rejected writebacks from L1D to L2 cache lines in E state] > l2_l1d_wb_rqsts.hit_m > [Not rejected writebacks from L1D to L2 cache lines in M state] > l2_l1d_wb_rqsts.miss > [Count the number of modified Lines evicted from L1 and missed L2. > (Non-rejected WBs from the DCU.)] > l2_lines_in.all > [L2 cache lines filling L2] > ... > > There are some exceptions, but generally it works this way.
You are missing my point in several ways: 1) Firstly, there are _tons_ of 'exceptions' to the 'stem name' grouping, to the level that makes it unusable for high level grouping of events. Here's the 'stem name' histogram on the SandyBridge event list: $ grep EventName pmu-events/arch/x86/SandyBridge_core.json | cut -d\. -f1 | cut -d\" -f4 | cut -d\_ -f1 | sort | uniq -c | sort -n 1 AGU 1 BACLEARS 1 EPT 1 HW 1 ICACHE 1 INSTS 1 PAGE 1 ROB 1 RS 1 SQ 2 ARITH 2 DSB2MITE 2 ILD 2 LOAD 2 LOCK 2 LONGEST 2 MISALIGN 2 SIMD 2 TLB 3 CPL 3 DSB 3 INST 3 INT 3 LSD 3 MACHINE 4 CPU 4 OTHER 4 PARTIAL 5 CYCLE 5 ITLB 6 LD 7 L1D 8 DTLB 10 FP 12 RESOURCE 21 UOPS 24 IDQ 25 MEM 37 BR 37 L2 131 OFFCORE Out of 386 events. This grouping has the following severe problems: - that's 41 'stem name' groups, way too much as a first hop high level structure. We want the kind of high level categorization I suggested: cache, decoding, branches, execution pipeline, memory events, vector unit events - which broad categories exist in all CPUs and are microarchitecture independent. - even these 'stem names' are mostly unstructured and unreadable. The two examples you cited are the best case that are borderline readable, but they cover less than 20% of all events. - the 'stem name' concept is not even used consistently, the names are essentially a random collection of Intel internal acronyms, which occasionally match up with high level concepts. These vendor defined names have very poor high level structure. - the 'stem names' are totally imbalanced: there's one 'super' category 'stem name': OFFCORE_RESPONSE, with 131 events in it and then there are super small groups in the list above. Not well suited to get a good overview about what measurement capabilities the hardware has. So forget about using 'stem names' as the high level structure. These events have no high level structure and we should provide that, instead of dumping 380+ events on the unsuspecting user. 2) Secondly, categorization and higher level hieararchy should be used to keep the list manageable. The fact that if _you_ know what to search for you can list just a subset does not mean anything to the new user trying to discover events. A simple 'perf list' should list the high level categories by default, with a count displayed that shows how many further events are within that category. (compacted tree output would be usable as well.) > The stem could be put into a separate header, but it would seem redundant to > me. Higher level categories simply don't exist in these names in any usable form, so it has to be created. Just redundantly repeating the 'stem name' would be silly, as they are unusable for the purposes of high level categorization. > > We don't just want the import the unstructured mess that these event files > > are > > - we want to turn them into real structure. We can still keep the messy > > vendor > > names as well, like IDQ.DSB_CYCLES, but we want to impose structure as well. > > The vendor names directly map to the micro architecture, which is whole point > of > the events. IDQ is a part of the CPU, and is described in the CPU manuals. > One > of the main motivations for adding event lists is to make perf match to that > documentation. Your argument is a logical fallacy: there is absolutely no conflict between also supporting quirky vendor names and also having good high level structure and naming, to make it all accessible to the first time user. > > 3) > > > > There should be good 'perf list' visualization for these events: grouping, > > individual names, with a good interface to query details if needed. I.e. it > > should be possible to browse and discover events relevant to the CPU the > > tool > > is executing on. > > I suppose we could change perf list to give the stem names as section headers > to > make the long list a bit more readable. No, the 'stem names' are crap - instead we want to create sensible high level categories and want to categorize the events, I gave you a few ideas above and in the previous mail. > Generally you need to have some knowledge of the micro architecture to use > these > events. There is no way around that. Here your argument again relies on a logical fallacy: there is absolutely no conflict between good high level structure, and the idea that you need to know about CPUs to make sense of hardware events that deal with fine internal details. Also, you are denying the plain fact that the highest level categories _are_ largely microarchitecture independent: can you show me a single modern mainstream x86 CPU that doesn't have these broad high level categories: - CPU cache - memory accesses - decoding, branch execution - execution pipeline - FPU, vector units ? There's none, and the reason is simple: the high level structure of CPUs is still dictated by basic physics, and physics is microarchitecture independent. Lower level structure will inevitably be microarchitecture and sometimes even model specific - but that's absolutely no excuse to not have good high level structure. So these are not difficult concepts at all, please make an honest effort at understanding then and responding to them, as properly addressing them is a must-have for this patch submission. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/