from:"Len Brown"

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-19 Thread Len Brown

On Mon, Apr 19, 2021 at 3:15 PM Borislav Petkov  wrote:

> All of a sudden you have *every* thread sporting a fat 8K buffer because
> the library decided to use a fat buffer feature for it.
>
> Nope, I don't want that to happen.

For this to happen, every thread would not only have to include/link-with
code that uses AMX, but that code would have to *run*.

I'm sure that the AI guys are super excited about matrix multiplication,
but I have a hard time imagining why grep(1) would find a use for it.

Indeed, if anyone expected AMX to be used by every task, we would have
never gone to the trouble of inventing the XFD hardware to support
the kernel's lazy 8KB buffer allocation.

cheers,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-19 Thread Len Brown

On Mon, Apr 19, 2021 at 10:15 AM Borislav Petkov  wrote:

> > Tasks are created without an 8KB AMX buffer.
> > Tasks have to actually touch the AMX TILE registers for us to allocate
> > one for them.
>
> When tasks do that it doesn't matter too much - for the library it does!
>
> If the library does that by default and the processes which comprise
> of that pipe I mentioned earlier, get all 8K buffers because the
> underlying library decided so and swinging those buffers around when
> saving/restoring contexts turns out to be a performance penalty, then we
> have lost.
>
> Lost because if that thing goes upstream in this way of use of AMX is
> allowed implicitly, there ain't fixing it anymore once it becomes an
> ABI.
>
> So, that library should ask the kernel whether it supports AMX and only
> use it if has gotten a positive answer.

Right, the library *does* ask the kernel whether it supports AMX (below).

> And by default that answer
> should be "no" because the majority of processes - that same pipe I keep
> mentioning - don't need it.

Indeed, the default is "no" because most libraries will *not* ask the system
for AMX support (below).  However, if they *did* probe for it,
and they *did* use it, the kernel would not stand in the way of
any of those requests.

> I have no good idea yet how granulary that should be - per process, per
> thread, whatever, but there should be a way for the kernel to control
> whether the library uses AMX, AVX512 or whatever fat state is out there
> available.
>
> Then, if a process wants the library to use AMX on its behalf, then it
> can say so and the library can do that but only after having asked for
> explicitly.

The ABI works like this:

0. App or library author decides AMX is useful at build-time.

1. App checks CPUID for AMX CPU feature
2. App checks XCR0 for AMX OS support

(if app touches AMX without these two being TRUE,
 it will suffer the consequence of a #UD when it touches an AMX instruction)

This ABI is how AVX works today.

What is new with AMX is the ability of the hardware and the OS
to delay the allocation of the context switch buffer until if/when
it is actually needed.

This is transparent, and thus not part of the ABI, unless you count
the absence of a mandated system call to be an ABI.

3. the application then touches an AMX register, triggering...
4.  #NM handled by the kernel, which allocates a context switch buffer
for that task, and dis-arms XFD.

Yes, we could invent a new system call and mandate that it be called
between #2 and #3.  However, we'd still do #4 in response, so I don't see any
value to that system call.  Indeed, I would advocate that glibc
replace it with a return statement.

So back to the example:
 | grep | awk | sed ...

Sure, if grep grows support for some AI feature that we haven't imaged
yet, then something in
its code flow is fully empowered to probe for AMX and use AMX on AMX hardware.
Sort of hard to imagine with the programs above that we know today,
but future programs
certainly could do this if they chose to.

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Len Brown

On Fri, Apr 16, 2021 at 6:14 PM Andy Lutomirski  wrote:

> My point is that every ...

I encourage you to continue to question everything and trust nobody.
While it may cost you a lot in counseling, it is certainly valuable,
at least to me! :-)

I do request, however, that feedback stay specific, stay technical,
and stay on-topic.
We all have plenty of real challenges we can be tackling with our limited time.

> Is there any authoritative guidance at all on what actually happens,
> performance-wise, when someone does AMX math?

Obviously, I can't speak to the performance of AMX itself pre-production,
and somebody who does that for a living will release stuff on or
before release day.

What I've told you about the performance side-effects on the system
(and lack thereof)
from running AMX code is an authoritative answer, and is as much as I
can tell you today.
If I failed to answer a question about AMX, my apologies, please re-ask it.

And if we learn something new between now and release day that is
relevant to this discussion,
I will certainly request to share it.

Our team (Intel Open Source Technology Center) advocated getting the existing
public AMX documentation published as early as possible.  However, if
you are really
nto the details of how AMX works, you may also be interested to know
that the AMX hardware patent filings are fully public ;-)

cheers,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Len Brown

> I get it.  That does not explain why LDMXCSR and VLDMXCSR cause
> pipelines stalls.

Sorry, I thought this thread was about AMX.
I don't know the answer to your LDMXCSR and VLDMXCSR question.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Len Brown

On Thu, Apr 15, 2021 at 1:47 AM Borislav Petkov  wrote:

> What I'd like to see is 0-overhead for current use cases and only
> overhead for those who want to use it. If that can't be done
> automagically, then users should request it explicitly. So basically you
> blow up the xsave buffer only for processes which want to do AMX.

Indeed, expanding the xsave buffer happens only for tasks that touch
AMX TILE registers.

> And this brings the question about libraries which, if they start using
> AMX by default - which doesn't sound like they will want to because AMX
> reportedly will have only a limited? set of users - if libraries start
> using it by default, then it better be worth the handling of the 8kb
> buffer per process.

I'm not aware of any intent to transparently use AMX for bcopy, like
what happened
with AVX-512.  (didn't they undo that mistake?)

> If not, this should also be requestable per process so that a simple
> pipe in Linux:
>
>  | grep | awk | sed ...
>
> and so on is not penalized to allocate and handle by default 8kb for
> *each* process' buffer in that pipe just because each is linking against
> glibc which has detected AMX support in CPUID and is using it too for
> some weird reason like some microbenchmark saying so.

Tasks are created without an 8KB AMX buffer.
Tasks have to actually touch the AMX TILE registers for us to allocate
one for them.

> But my initial question was on the "establishing" part and was asking
> where we have established anything wrt AMX.

The patch set on LKML establishes working AMX Linux support in public.
I am thankful for your and other public review and feedback on that series.
I can think of 3 actual bugs that were found in the process.

thanks,
Len Brown Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Len Brown

On Thu, Apr 15, 2021 at 12:24 PM Andy Lutomirski  wrote:
> On Wed, Apr 14, 2021 at 2:48 PM Len Brown  wrote:

> > > ... the transition penalty into and out of AMX code

The concept of 'transition' exists between AVX and SSE instructions
because it is possible to mix both instruction sets and touch different
parts of the same registers.  The "unused" parts of those registers
need to be tracked to assure that data is not lost when mixing.

This concept is moot with AMX, which has its own dedicated registers.

> What is the actual impact of a trivial function that initializes the
> tile config, does one tiny math op, and then does TILERELEASE?

1. Task takes #NM on first touch of TILE registers
2. Kernel allocates 8KB for that task and dis-arms XFD
3. Kernel context switches XFD with task state

If the task takes a signal *before* TILERELEASE
4. XSAVE transfers AMX state to signal stack, XRESTOR the reverse.

If the task context switches *before* TILERELEASE
5. kernel context switch XSAVES the AMX state to 8KB context switch
buffer, XRESTORE the reverse.

If the task takes a signal *after* TILERELEASE
4. XSAVE does NOT transfer AMX state (or zeros) to signal stack, 8KB
is consumed on signal stack but not touched.  XRESTOR, the reverse.

If the task context switches *after* TILERELEASE
5. kernel contexts switch ignores INIT=1 AMX state.  8KB buffer is quiescent.

As we discussed previously,  there is no impact to frequency from
either INIT=0 or INIT=1 AMX state.
Frequency is impacted by *compute*, and since there isn't any compute
this scenario, there is no frequency impact.

As we discussed previously, for INIT=1 (which the kernel guarantees,
there is also no impact on power.

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Len Brown

On Wed, Apr 14, 2021 at 5:58 AM Borislav Petkov  wrote:
>
> On Tue, Apr 13, 2021 at 03:51:50PM -0400, Len Brown wrote:
> > AMX does the type of matrix multiplication that AI algorithms use. In
> > the unlikely event that you or one of the libraries you call are doing
> > the same, then you will be very happy with AMX. Otherwise, you'll
> > probably not use it.
>
> Which sounds to me like AMX is something which should not be enabled
> automatically but explicitly requested. I don't see the majority of the
> processes on the majority of the Linux machines out there doing AI with
> AMX - at least not anytime soon. If it becomes ubiquitous later, we can
> make it automatic then.

I'm pretty sure that the "it isn't my use case of interest, so it
doesn't matter"
line of reasoning has long been established as -EINVAL ;-)

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-14 Thread Len Brown

e all recognize that there are a non-zero number of applications that
fail on AVX-512 hardware because of this issue.

We all recognize that if not addressed, AVX would increase the likelihood
of an application failing due to the too-small-alternative-signal-stack issue.

I thank the ARM community for taking action on this issue and setting the
example of the ALT-VEC solution to compute and expose required stack
size at run-time.
I further thank HJ Lu and the libc team for picking up that ball and shipping
the solution to this problem with an updated ABI in glibc 2.34.

I acknowledge that it is not impossible to fail after that fix -- you can
ignore the ABI, or you could hardcode sizes, or you could be statically
linked to the old libc.  But it gets increasingly harder to fail, the kernel
signal series has a new run-time check to prevent data corruption
that could have happened in the past, and the remedy is clear --
re-build with the new glibc.

Yes, it would have been good if this were done before AVX-512 deployed.

3. Can we do better in the long term?

Assuming the ABI update in #2 addresses the issue with applications that
declare their own alt-sig-stack, we have a backward compatible solution today.

Somewhere in the past, the decision was made that all architectural state
should be exposed to signal handlers.  And a decision was made on x86
that it should be in uncompacted XSTATE format.

There are programs today that count on both of these things being true,
and if we change either of those, we break applications.

But the irony is that there are a vanishingly small number of signal handlers
that actually care at all about that state, and it seems to be wasteful
to give it to them.

So the question is whether to continue giving them all that information,
or to give them a way to decline -- to give the option for signals to be
more lightweight.

We can certainly do this.  The question is if it is important enough to bother.

What applications would notice if signal handlers were faster?
Would those applications be willing to update to opt-in to a
new incompatible signal handling ABI, where the kernel
took the time to supply only the state that they request?

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-13 Thread Len Brown

On Tue, Apr 13, 2021 at 4:16 PM Andy Lutomirski  wrote:
>
> On Mon, Apr 12, 2021 at 4:46 PM Len Brown  wrote:
> >
> > On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski  wrote:
> >
> > > AMX: Multiplying a 4x4 matrix probably looks *great* in a
> > > microbenchmark.  Do it once and you permanently allocate 8kB (is that
> > > even a constant?  can it grow in newer parts?), potentially hurts all
> > > future context switches, and does who-knows-what to Turbo licenses and
> > > such.
> >
> > Intel expects that AMX will be extremely valuable to key workloads.
> > It is true that you may never run that kind of workload on the machine
> > in front of you,
> > and so you have every right to be doubtful about the value of AMX.
>
> I fully believe that AMX will be amazing when used for the right
> workload.  The problem is that a library may have no way to tell
> whether a workload is the type of computationally intensive workload
> for which it makes sense.  Imagine you have a little function:
>
> int matrix_times_vector(int dim, float *out, const float *matrix,
> const float *vector);
>
> A clever library might use AMX for this.  If dim == 4 and the caller
> is planning to call it in a long, tight loop, maybe this even makes
> sense.  If dim == 4 and it's being called once, AMX is probably a
> losing proposition.  With previous technologies, at least the impact
> was limited to the function itself and maybe once per call to the
> caller.  But now, with AMX, the program that invoked this takes a
> performance and memory hit *forever* if it uses AMX once.

Again...

As this is a "clever" library, built with a clever toolchain, and the
result is that
TILERELEASE was properly issued at the end of computation.
Thus the hardware knows that the  (volatile) AMX registers are no longer live.

The XSAVE hardware recognizes this INIT=1 condition and
transfers NO DATA "*forever*".  This is true both on context switch (compacted)
where it is automatic, and on (uncompacted) signal delivery, where we
check for this case.

Was that the "performance hit" of concern, or did I miss something?

Yes, it is true that the kernel allocated a context switch buffer for
the lifetime
of that task, and it will not be freed until that task exits.

If this proves to be an issue, there is nothing
preventing us from implementing a re-claim scheme for a rarely used buffer.
After recognizing this situation, we'd simply arm XFD, free the
buffer, and from then onwards, the task behaves exactly as if had
never touched AMX.

However, nobody has yet suggested that would be a common situation
worth an optimization to reclaim that task's 8KB.

> Beyond that, we have the signal handling issue.

I'm unaware of any unresolved feedback on the signal handling series
other than a wistful "wouldn't a new SIGFAIL be more clear (for future apps)
than the existing SIGSEGV?"  I agree with this sentiment, but I don't
think we should hold up a patch to prevent corrupting user data
because a new signal number to describe the scenario doesn't exit.
Particularly since the new code that knows about the new SIGFAIL
will also be new code that has been compiled with the new glibc
that for most cases will prevent this scenario in the first place...

> One solution, going
> off of what WIlly mentioned, is:
>
> bool amx_begin(void *signal_save_buffer);
> void amx_end();
>
> In the amx_begin() region, if you get a signal, the AMX state is saved
> in the buffer.  Outside the region, if you get a signal and AMX is in
> use, the kernel will either unceremoniously kill the task or will
> deliver SIGYOUBLEWIT. [0]

I think it is clear that if a new signal ABI is going to be invented,
that it should be opt-in on state, so that it can run fast on machines
far into the future by not choosing to opt-in on anything.

It isn't clear that changing the signal save state around critical regions
(in multiple threads) so that a single (per process definition) of a signal
handler gets a different result at different times is going to make that
(new) signal handler author especially happy.  More likely they
either always want the state, or they do not.

> I'm really hoping some userspace people can chime in.

Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-13 Thread Len Brown

Thanks for sharing your perspective, Willy.

I agree that if your application is so sensitive that you need to hand-code
your own memcmp, then linking with (any) new version of (any) dynamic library
is a risk you must consider carefully.

AMX does the type of matrix multiplication that AI algorithms use.
In the unlikely event that you or one of the libraries you call are
doing the same,
then you will be very happy with AMX.  Otherwise, you'll probably not use it.

I acknowledge the issue with the toolchain transparently using AVX-512
for copying data, and how that approach impacted systems with
a poor AVX-512 hardware implementation.  FWIW. I'm not aware of any
plans to implicitly use AMX this way, and I'm not aware of any non-Xeon
AMX implementations in the near future.

cheers,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Len Brown

On Mon, Apr 12, 2021 at 8:17 PM Thomas Gleixner  wrote:

> I'm fine with that as long the kernel has a way to detect that and can
> kill the offending application/library combo with an excplicit
> -SIG_NICE_TRY.

Agreed.  The new run-time check for altsigstack overflow is one place
we can do that.
Let me know if you think of others,

thanks,
Len Brown,
Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Len Brown

On Mon, Apr 12, 2021 at 11:21 AM Andy Lutomirski  wrote:

> AMX: Multiplying a 4x4 matrix probably looks *great* in a
> microbenchmark.  Do it once and you permanently allocate 8kB (is that
> even a constant?  can it grow in newer parts?), potentially hurts all
> future context switches, and does who-knows-what to Turbo licenses and
> such.

Intel expects that AMX will be extremely valuable to key workloads.
It is true that you may never run that kind of workload on the machine
in front of you,
and so you have every right to be doubtful about the value of AMX.

The AMX architectural state size is not expected to change.
Rather, if a "new AMX" has a different state size, it is expected to
use a new feature bit, different from AMX.

The AMX context switch buffer is allocated only if and when a task
touches AMX registers.

Yes, there will be data transfer to and from that buffer when three
things all happen.
1. the data is valid
2. hardware interrupts the application
3. the kernel decides to context switch.

There will be no data transfer of AMX architectural state when it is
in INIT state.

As AMX registers are volatile, correct software will always have them
in INIT state before calls, including system calls.

I've addressed turbo licenses already.

> Even putting aside all kernel and ABI issues, is it actually a good
> idea for user libraries to transparently use these new features?  I'm
> not really convinced.  I think that serious discussion among userspace
> people is needed.

At the risk of stating the obvious...
Intel's view is that libraries that deliver the most value from the
hardware are a "good thing",
and that anything preventing libraries from getting the most value
from the hardware is a "bad thing":-)

cheers,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-11 Thread Len Brown

On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski  wrote:
>
> On Fri, Apr 9, 2021 at 1:53 PM Len Brown  wrote:
> >
> > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski  wrote:
> > >
> > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown  wrote:
> > >
> > > > We added compiler annotation for user-level interrupt handlers.
> > > > I'm not aware of it failing, or otherwise being confused.
> > >
> > > I followed your link and found nothing. Can you elaborate?  In the
> > > kernel, we have noinstr, and gcc gives approximately no help toward
> > > catching problems.
> >
> > A search for the word "interrupt" on this page
> > https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
> > comes to the description of this attribute:
> >
> > __attribute__ ((interrupt))
> >
>
> I read that and I see no mention of anything saying "this will
> generate code that does not touch extended state".  Instead I see,
> paraphrasing, "this will generate code with an ABI that is completely
> inappropriate for use in a user space signal handler".  Am I missing
> something?

Again...

An analogous annotation could be created for fast signals.
gcc can be told exactly what registers and instructions it can use for
that routine.

If somebody can suggest a way to make fast signal handers faster
than saving only the state that they-themselves actually use, I'm all ears.

> > > > dynamic XCR0 breaks the installed base, I thought we had established 
> > > > that.
> > >
> > > I don't think this is at all established.  If some code thinks it
> > > knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> > > is not necessarily a showstopper.
> >
> > My working assumption is that crashing applications actually *is* a 
> > showstopper.
> > Please clarify.
>
> I think you're presuming that some program actually does this.  If no
> program does this, it's not an ABI break.

So you agree that for a program that uses xgetbv to read XCR0 and compute
XSTATE size for user-space use of XSAVE can break if XCR0 changes
during its lifetime.
But you don't believe such software exists?

> More relevantly, this can only happen in a process that uses XSAVE and
> thinks it knows the size that *also* does the prctl to change XCR0.
> By construction, existing programs can't break unless they load new
> dynamic libraries that break them.

Let's say that a program does math.
It calls a library to do that math.
It doesn't know or care what instructions the library uses to do math.
eg. the library uses SSE on an Atom, and uses AVX512 on a Xeon.

Who calls the new prctl, the program, or the library?

If it is the program, how does it know that the library wants to use
what instructions?

If it is the library, then you have just changed XCR0 at run-time and
you expose breakage of the thread library that has computed XSAVE size.

> > > > We've also established that when running in a VMM, every update to
> > > > XCR0 causes a VMEXIT.
> > >
> > > This is true, it sucks, and Intel could fix it going forward.
> >
> > What hardware fix do you suggest?
> > If a guest is permitted to set XCR0 bits without notifying the VMM,
> > what happens when it sets bits that the VMM doesn't know about?
>
> The VM could have a mask of allowed XCR0 bits that don't exist.
>
> TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
> TDX.  Surely plain VMX could fix it too.

There are two cases.

1. Hardware that exists today and in the foreseeable future.

VM modification of XCR0 results in VMEXIT to VMM.
The VMM sees bits set by the guest, and so it can accept what
it supports, or send the VM a fault for non-support.

Here it is not possible for the VMM to change XCR0 without the VMM knowing.

2. Future Hardware that allows guests to write XCR0 w/o VMEXIT.

Not sure I follow your proposal.

Yes, the VM effectively has a mask of what is supported,
because it can issue CPUID.

The VMM virtualizes CPUID, and needs to know it must not
expose to the VM any state features it doesn't support.
Also, the VMM needs to audit XCR0 before it uses XSAVE,
else the guest could attack or crash the VMM through
buffer overrun.  Is this what you suggest?

If yes, what do you suggest in the years between now and when
that future hardware and VMM exist?

> > > > I thought the goal was to allow new programs to have fast signal 
> > > > handlers.
> > > > By default, those fast signal handlers would have a stable state
> > > > image, and would
> > > > not inherit large architectural state on their stacks, and could thus
>

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-09 Thread Len Brown

On Wed, Mar 31, 2021 at 6:54 PM Borislav Petkov  wrote:
>
> On Wed, Mar 31, 2021 at 06:28:27PM -0400, Len Brown wrote:
> > dynamic XCR0 breaks the installed base, I thought we had established
> > that.
>
> We should do a clear cut and have legacy stuff which has its legacy
> expectations on the XSTATE layout and not touch those at all.
>
> And then all new apps which will use these new APIs can go and request
> whatever fancy new state constellations we support. Including how they
> want their signals handled, etc.
>
> Fat states like avx512, amx etc will be off by default and apps
> explicitly requesting those, can get them.
>
> That's it.

100% agreement from me!  (does anybody disagree?)

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-09 Thread Len Brown

On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski  wrote:
>
> On Wed, Mar 31, 2021 at 3:28 PM Len Brown  wrote:
>
> > We added compiler annotation for user-level interrupt handlers.
> > I'm not aware of it failing, or otherwise being confused.
>
> I followed your link and found nothing. Can you elaborate?  In the
> kernel, we have noinstr, and gcc gives approximately no help toward
> catching problems.

A search for the word "interrupt" on this page
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes
comes to the description of this attribute:

__attribute__ ((interrupt))

> > dynamic XCR0 breaks the installed base, I thought we had established that.
>
> I don't think this is at all established.  If some code thinks it
> knows the uncompacted XSTATE size and XCR0 changes, it crashes.  This
> is not necessarily a showstopper.

My working assumption is that crashing applications actually *is* a showstopper.
Please clarify.

> > We've also established that when running in a VMM, every update to
> > XCR0 causes a VMEXIT.
>
> This is true, it sucks, and Intel could fix it going forward.

What hardware fix do you suggest?
If a guest is permitted to set XCR0 bits without notifying the VMM,
what happens when it sets bits that the VMM doesn't know about?

> > I thought the goal was to allow new programs to have fast signal handlers.
> > By default, those fast signal handlers would have a stable state
> > image, and would
> > not inherit large architectural state on their stacks, and could thus
> > have minimal overhead on all hardware.
>
> That is *a* goal, but not necessarily the only goal.

I fully support coming up with a scheme for fast future-proof signal handlers,
and I'm willing to back that up by putting work into it.

I don't see any other goals articulated in this thread.

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Len Brown

On Wed, Mar 31, 2021 at 12:53 PM Andy Lutomirski  wrote:

> But this whole annotation thing will require serious compiler support.
> We already have problems with compilers inlining functions and getting 
> confused about attributes.

We added compiler annotation for user-level interrupt handlers.
I'm not aware of it failing, or otherwise being confused.

Why would compiler support for fast-signals be any more "serious"?

> An API like:
>
> if (get_amx()) {
>  use AMX;
> } else {
>  don’t;
> }
>
> Avoids this problem. And making XCR0 dynamic, for all its faults, at least 
> helps force a degree of discipline on user code.

dynamic XCR0 breaks the installed base, I thought we had established that.

We've also established that when running in a VMM, every update to
XCR0 causes a VMEXIT.

I thought the goal was to allow new programs to have fast signal handlers.
By default, those fast signal handlers would have a stable state
image, and would
not inherit large architectural state on their stacks, and could thus
have minimal overhead on all hardware.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Len Brown

On Wed, Mar 31, 2021 at 5:42 PM Robert O'Callahan  wrote:
>
> For the record, the benefits of dynamic XCR0 for rr recording
> portability still apply. I guess it'd be useful for CRIU too. We would
> also benefit from anything that incentivizes increased support for
> CPUID faulting.

As previously mentioned, today we don't have an architectural way to
trap a user into the kernel on CPUID,
even though we can do this for a VMM.

But spoofing CPUID isn't a solution to all problems.
The feature really needs to be OFF to prevent users from using it,
even if the supported mechanisms of discovering that feature say "NOT PRESENT".

Today there are plenty of users who will opportunistically try everything
in the cloud and choose the machine that allows them to do something
that other machines will not -- even if it is not officially supported.

If something is not enumerated, it really needs to also be turned off.

cheers,
--Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Len Brown

On Tue, Mar 30, 2021 at 6:01 PM David Laight  wrote:

> > Can we leave it in live registers?  That would be the speed-of-light
> > signal handler approach.  But we'd need to teach the signal handler to
> > not clobber it.  Perhaps that could be part of the contract that a
> > fast signal handler signs?  INIT=0 AMX state could simply sit
> > patiently in the AMX registers for the duration of the signal handler.
> > You can't get any faster than doing nothing :-)
> >
> > Of course part of the contract for the fast signal handler is that it
> > knows that it can't possibly use XRESTOR of the stuff on the stack to
> > necessarily get back to the state of the signaled thread (assuming we
> > even used XSTATE format on the fast signal handler stack, it would
> > forget the contents of the AMX registers, in this example)
>
> gcc will just use the AVX registers for 'normal' code within
> the signal handler.
> So it has to have its own copy of all the registers.
> (Well, maybe you could make the TMX instructions fault,
> but that would need a nested signal delivered.)

This is true, by default, but it doesn't have to be true.

Today, gcc has an annotation for user-level interrupts
https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html#x86-Function-Attributes

An analogous annotation could be created for fast signals.
gcc can be told exactly what registers and instructions it can use for
that routine.

Of course, this begs the question about what routines that handler calls,
and that would need to be constrained too.

Today signal-safety(7) advises programmers to limit what legacy signal handlers
can call.  There is no reason that a fast-signal-safety(7) could not be created
for the fast path.

> There is also the register save buffer that you need in order
> to long-jump out of a signal handler.
> Unfortunately that is required to work.
> I'm pretty sure the original setjmp/longjmp just saved the stack
> pointer - but that really doesn't work any more.
>
> OTOH most signal handlers don't care - but there isn't a flag
> to sigset() (etc) so ask for a specific register layout.

Right, the idea is to optimize for *most* signal handlers,
since making any changes to *all* signal handlers is intractable.

So the idea is that opting-in to a fast signal handler would opt-out
of some legacy signal capibilities.  Complete state is one of them,
and thus long-jump is not supported, because the complete state
may not automatically be available.

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Len Brown

On Tue, Mar 30, 2021 at 4:20 PM Andy Lutomirski  wrote:
>
>
> > On Mar 30, 2021, at 12:12 PM, Dave Hansen  wrote:
> >
> > On 3/30/21 10:56 AM, Len Brown wrote:
> >> On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski  
> >> wrote:
> >>>> On Mar 30, 2021, at 10:01 AM, Len Brown  wrote:
> >>>> Is it required (by the "ABI") that a user program has everything
> >>>> on the stack for user-space XSAVE/XRESTOR to get back
> >>>> to the state of the program just before receiving the signal?
> >>> The current Linux signal frame format has XSTATE in uncompacted format,
> >>> so everything has to be there.
> >>> Maybe we could have an opt in new signal frame format, but the details 
> >>> would need to be worked out.
> >>>
> >>> It is certainly the case that a signal should be able to be delivered, 
> >>> run “async-signal-safe” code,
> >>> and return, without corrupting register contents.
> >> And so an an acknowledgement:
> >>
> >> We can't change the legacy signal stack format without breaking
> >> existing programs.  The legacy is uncompressed XSTATE.  It is a
> >> complete set of architectural state -- everything necessary to
> >> XRESTOR.  Further, the sigreturn flow allows the signal handler to
> >> *change* any of that state, so that it becomes active upon return from
> >> signal.
> >
> > One nit with this: XRSTOR itself can work with the compacted format or
> > uncompacted format.  Unlike the XSAVE/XSAVEC side where compaction is
> > explicit from the instruction itself, XRSTOR changes its behavior by
> > reading XCOMP_BV.  There's no XRSTORC.
> >
> > The issue with using the compacted format is when legacy software in the
> > signal handler needs to go access the state.  *That* is what can't
> > handle a change in the XSAVE buffer format (either optimized/XSAVEOPT,
> > or compacted/XSAVEC).
>
> The compacted format isn’t compact enough anyway. If we want to keep AMX and 
> AVX512 enabled in XCR0 then we need to further muck with the format to omit 
> the not-in-use features. I *think* we can pull this off in a way that still 
> does the right thing wrt XRSTOR.

Agreed.  Compacted format doesn't save any space when INIT=0, so it is
only a half-step forward.

> If we go this route, I think we want a way for sigreturn to understand a 
> pointer to the state instead of inline state to allow programs to change the 
> state.  Or maybe just to have a way to ask sigreturn to skip the restore 
> entirely.

The legacy approach puts all architectural state on the signal stack
in XSTATE format.

If we make the signal stack smaller with a new fast-signal scheme, we
need to find another place for that state to live.

It can't live in the task context switch buffer.  If we put it there
and then take an interrupt while running the signal handler, then we'd
overwrite the signaled thread's state with the signal handler's state.

Can we leave it in live registers?  That would be the speed-of-light
signal handler approach.  But we'd need to teach the signal handler to
not clobber it.  Perhaps that could be part of the contract that a
fast signal handler signs?  INIT=0 AMX state could simply sit
patiently in the AMX registers for the duration of the signal handler.
You can't get any faster than doing nothing :-)

Of course part of the contract for the fast signal handler is that it
knows that it can't possibly use XRESTOR of the stuff on the stack to
necessarily get back to the state of the signaled thread (assuming we
even used XSTATE format on the fast signal handler stack, it would
forget the contents of the AMX registers, in this example)

Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Len Brown

On Tue, Mar 30, 2021 at 1:06 PM Andy Lutomirski  wrote:

> > On Mar 30, 2021, at 10:01 AM, Len Brown  wrote:

> > Is it required (by the "ABI") that a user program has everything
> > on the stack for user-space XSAVE/XRESTOR to get back
> > to the state of the program just before receiving the signal?
>
> The current Linux signal frame format has XSTATE in uncompacted format,
> so everything has to be there.
> Maybe we could have an opt in new signal frame format, but the details would 
> need to be worked out.
>
> It is certainly the case that a signal should be able to be delivered, run 
> “async-signal-safe” code,
> and return, without corrupting register contents.

And so an an acknowledgement:

We can't change the legacy signal stack format without breaking
existing programs.  The legacy is uncompressed XSTATE.  It is a
complete set of architectural state -- everything necessary to
XRESTOR.  Further, the sigreturn flow allows the signal handler to
*change* any of that state, so that it becomes active upon return from
signal.

And a proposal:

Future programs, which know that they don't need the full-blown legacy
signal stack format, can opt-in to a new format.  That new format, can
be minimal (fast) by default.  Perhaps, as Noah suggests, it could
have some sort of mechanism where the program can explicitly select
which state components they would want included on their signal stack,
and restored by sigreturn.

If the new fast-signal format is successful, in a number of years, it
will have spread to have taken over the world.

thoughts?

Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-30 Thread Len Brown

Andy,

I agree, completely, with your description of the challenge,
thank you for focusing the discussion on that problem statement.

Question:

Is it required (by the "ABI") that a user program has everything
on the stack for user-space XSAVE/XRESTOR to get back
to the state of the program just before receiving the signal?

My understanding is that there are programs that do this.
However, if it is not guaranteed to work, that could greatly simplify
what we are required to put on the signal stack.

thanks,
-Len

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-30 Thread Len Brown

On Tue, Mar 30, 2021 at 4:28 AM Thomas Gleixner  wrote:
>
> Len,
>
> On Mon, Mar 29 2021 at 18:16, Len Brown wrote:
> > On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:
> > Let me know if this problem description is fair:
> >
> > Many-core Xeon servers will support AMX, and when I run an AMX application
> > on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my 
> > CPU.
> > If Linux cpuidle requests C6, the hardware will demote to C1E.
> >
> > The concern is that a core in C1E will negatively impact power of
> > self, or performance
> > of a neighboring core.
> >
> > This is what we are talking about, right?
>
> Correct.
>
> > I'm delighted that there are Xeon customers, who care about this power 
> > savings.
> > Unfortunately, they are the exception, not the rule.
>
> That maybe true or not. The point is that there is some side effect and
> from a correctness point of view it needs to be addressed.

I don't see how demoting C6 to C1E is a "correctness" issue.
There is nothing "incorrect" about demoting to C1E when software permits C6,
and as I mentioned, this happens all the time.
All architectural state, including the AMX state, is preserved by hardware.

I do agree that there is the possibility that this scenario can result
may not be optimal power savings.
It isn't clear how often that situation might occur.

> >>- Use TILERELEASE on context switch after XSAVES?
> >
> > Yes, that would be perfectly reasonable.
> >
> >>- Any other mechanism on context switch
> >
> > XRESTOR of a context with INIT=1 would also do it.
> >
> >>- Clear XFD[18] when going idle and issue TILERELEASE depending
> >>  on the last state
> >
> > I think you mean to *set* XFD.
> > When the task touched AMX, he took a #NM, and we cleared XFD for that task.
> > So when we get here, XFD is already clear (unarmed).
> > Nevertheless, the setting of XFD is moot here.
>
> No. We set XFD when the task which used AMX schedules out. If the CPU
> reaches idle without going back to user space then XFD is still set and
> AMX INIT still 0. So my assumption was that in order to issue
> TILERELEASE before going idle, you need to clear XFD[18] first, but I
> just saw in the spec that it is not necessary.

Right, XFD setting is moot here.

> >>- Use any other means to set the thing back into INIT=1 state when
> >>  going idle
> >
> > TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.
> >
> >> There is no option 'shrug and ignore' unfortunately.
> >
> > I'm not going to say it is impossible that this path will matter.
> > If some terrible things go wrong with the hardware, and the hardware
> > is configured and used in a very specific way, yes, this could matter.
>
> So then let me summarize for the bare metal case:
>
>1) The paragraph in the specification is unclear and needs to be
>   rephrased.
>
>   What I understood from your explanations so far:
>
>   When AMX is disabled by clearing XCR0[18:17], by clearing
>   CR4.OSXSAVE, or by setting IA32_XFD[18], then there are no
>   negative side effects due to AMX INIT=0 as long as the CPU is
>   executing.

Right.

>   The only possible side effect is when the CPU goes idle because
>   AMX INIT=0 limits C states.

Right.

>2) As a consequence of #1 there is no further action required on
>   context switch when XFD[18] is set.

I agree.

>3) When the CPU goes idle with AMX INIT=0 then the idle code should
>   invoke TILERELEASE. Maybe the condition is not even necessary and
>   TILERELEASE can be invoked unconditionally before trying to enter
>   idle.
>
> If that's correct, then this should be part of the next series.

If you insist, then that is fine with me.

Personally, I would prefer to be able to measure an actual benefit
before adding code, but this one is small, and so I'm not religious about it.

> > In the grand scheme of things, this is a pretty small issue, say,
> > compared to the API discussion.
>
> No argument about that.
>
> Thanks,
>
> tglx



-- 
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Len Brown

On Mon, Mar 29, 2021 at 2:16 PM Andy Lutomirski  wrote:
>
>
> > On Mar 29, 2021, at 8:47 AM, Len Brown  wrote:
> >
> > On Sat, Mar 27, 2021 at 5:58 AM Greg KH  wrote:
> >>> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> >>> Hi Andy,
> >>> Say a mainline links with a math library that uses AMX without the
> >>> knowledge of the mainline.
> >
> > sorry for the confusion.
> >
> > mainline = main().
> >
> > ie. the part of the program written by you, and not the library you linked 
> > with.
> >
> > In particular, the library may use instructions that main() doesn't know 
> > exist.
>
> If we pretend for a bit that AMX were a separate device instead of a part of 
> the CPU, this would be a no brainer: something would be responsible for 
> opening a device node or otherwise requesting access to the device.
>
> Real AMX isn’t so different. Programs acquire access either by syscall or by 
> a fault, they use it, and (hopefully) they release it again using 
> TILERELEASE. The only thing special about it is that, supposedly, acquiring 
> and releasing access (at least after the first time) is quite fast.  But 
> holding access is *not* free — despite all your assertions to the contrary, 
> the kernel *will* correctly context switch it to avoid blowing up power 
> consumption, and this will have overhead.
>
> We’ve seen the pattern of programs thinking that, just because something is a 
> CPU insn, it’s free and no thought is needed before using it. This happened 
> with AVX and AVX512, and it will happen again with AMX. We *still* have a 
> significant performance regression in the kernel due to screwing up the AVX 
> state machine, and the only way I know about any of the details is that I 
> wrote silly test programs to try to reverse engineer the nonsensical behavior 
> of the CPUs.
>
> I might believe that Intel has figured out how to make a well behaved XSTATE 
> feature after Intel demonstrates at least once that it’s possible.  That 
> means full documentation of all the weird issues, no new special cases, and 
> the feature actually making sense in the context of XSTATE.  This has not 
> happened.  Let’s list all of them:
>
> - SSE.  Look for all the MXCSR special cases in the pseudocode and tell me 
> with a straight face that this one works sensibly.
>
> - AVX.  Also has special cases in the pseudocode. And has transition issues 
> that are still problems and still not fully documented. L
>
> - AVX2.  Horrible undocumented performance issues.  Otherwise maybe okay?
>
> - MPX: maybe the best example, but the compat mode part got flubbed and it’s 
> MPX.
>
> - PKRU: Should never have been in XSTATE. (Also, having WRPKRU in the ISA was 
> a major mistake, now unfixable, that seriously limits the usefulness of the 
> whole feature.  I suppose Intel could release PKRU2 with a better ISA and 
> deprecate the original PKRU, but I’m not holding my breath.)
>
> - AVX512: Yet more uarch-dependent horrible performance issues, and Intel has 
> still not responded about documentation.  The web is full of people 
> speculating differently about when, exactly, using AVX512 breaks performance. 
> This is NAKked in kernel until docs arrive. Also, it broke old user programs. 
>  If we had noticed a few years ago, AVX512 enablement would have been 
> reverted.
>
> - AMX: This mess.
>
> The current system of automatic user enablement does not work. We need 
> something better.

Hi Andy,

Can you provide a concise definition of the exact problemI(s) this thread
is attempting to address?

Thank ahead-of-time for excluding "blow up power consumption",
since that paranoia is not grounded in fact.

thanks,
-Len

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown

On Mon, Mar 29, 2021 at 2:49 PM Thomas Gleixner  wrote:

> According to documentation it is irrelevant whether AMX usage is
> disabled via XCR0, CR4.OSXSAVE or XFD[18]. In any case the effect of
> AMX INIT=0 will prevent C6.
>
> As I explained in great length there are enough ways to get into a
> situation where this can happen and a CPU goes idle with AMX INIT=0.
>
> So what are we supposed to do?

Let me know if this problem description is fair:

Many-core Xeon servers will support AMX, and when I run an AMX application
on one, when I take an interrupt with AMX INIT=0, Linux may go idle on my CPU.
If Linux cpuidle requests C6, the hardware will demote to C1E.

The concern is that a core in C1E will negatively impact power of
self, or performance
of a neighboring core.

This is what we are talking about, right?

First, I should mention that if I threw a dart at a map of Xeons
deployed across the universe, the chances are "significant" that I'd
hit one that is configured with C6 disabled, and this discussion would be moot.

Second, I should mention that Linux cpuidle demotes from deep C-states
to shallow ones all day long.  This is typically due to expected timer
expiration,
and other heuristics.

Third, I should mention that the processor itself demotes from C6 to C1E
for a number of reasons -- basically like what Linux is doing, but in HW.

Albeit, the hardware does have the capability to "un-demote" when it demotes
and recognizes it made a mistake, and that "un-demote" capability would
not be present if the reason for demotion was AVX INIT=0.

Okay, that said, let's assume we have found a system where this problem
could happen, and we use it in a way that makes it happen.  Would we notice?

If your system were profoundly idle, and one or more cores were in C1E,
then it would prevent the SOC from entering Package C6 (if enabled).
Yes, there is a measurable idle power difference between Package C1E
and Package C6.  (indeed, this is why Package C6 exists).

I'm delighted that there are Xeon customers, who care about this power savings.
Unfortunately, they are the exception, not the rule.

If you were to provoke this scenario on many cores simultaneously, then
I expect you could detect a power difference between C1E and CC6.
However, that difference would be smaller than the difference
in power due to the frequency choice of the running cores,
because it is basically just the L2-leakage vs L2-off difference.

Regarding frequency credits for a core being in C1E vs C6.
Yes, this is factored into the frequency credits for turbo mode.
How much impact, I can't say, because that information is not yet available.
However, this is mitigated by the fact that Xeon single core turbo
is deployed differently than client.  Xeon's are deployed
more with multi-core turbo in mind, and so how much you'll
notice C1E vs C6 may not be significant, unless perhaps it happened
on all the cores across the system.

>- Use TILERELEASE on context switch after XSAVES?

Yes, that would be perfectly reasonable.

>- Any other mechanism on context switch

XRESTOR of a context with INIT=1 would also do it.

>- Clear XFD[18] when going idle and issue TILERELEASE depending
>  on the last state

I think you mean to *set* XFD.
When the task touched AMX, he took a #NM, and we cleared XFD for that task.
So when we get here, XFD is already clear (unarmed).
Nevertheless, the setting of XFD is moot here.

>- Use any other means to set the thing back into INIT=1 state when
>  going idle

TILERELEASE and XRESTOR are the tools in the toolbox, if necessary.

> There is no option 'shrug and ignore' unfortunately.

I'm not going to say it is impossible that this path will matter.
If some terrible things go wrong with the hardware, and the hardware
is configured and used in a very specific way, yes, this could matter.

In the grand scheme of things, this is a pretty small issue,
say, compared to the API discussion.

thanks,
Len Brown, Intel Open Source Technology Center

-Len

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown

On Mon, Mar 29, 2021 at 1:43 PM Andy Lutomirski  wrote:

> > *switching* XCR0 on context switch is slow, but perfectly legal.
>
> How slow is it?  And how slow is switching XFD?  XFD is definitely 
> serializing?

XCR0 writes in a VM guest cause a VMEXIT..
XCR writes in a VM guest do not.

I will find out what the relative cost is on bare metal, where VMEXIT
is not an issue.
-- 
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Len Brown

> In particular, the library may use instructions that main() doesn't know 
> exist.

And so I'll ask my question another way.

How is it okay to change the value of XCR0 during the run time of a program?

I submit that it is not, and that is a deal-killer for a request/release API.

eg.  main() doesn't know that the math library wants to use AMX,
and neither does the threading library.  So main() doesn't know to
call the API before either library is invoked.  The threading library starts up
and creates user-space threads based on the initial value from XCR0.
Then the math library calls the API, which adds bits to XCRO,
and then the user-space context switch in the threading library corrupts data
because the new XCR0 size doesn't match the initial size.

-Len

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown

On Mon, Mar 29, 2021 at 11:43 AM Len Brown  wrote:
>
> On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:
>
> > > I found the author of this passage, and he agreed to revise it to say this
> > > was targeted primarily at VMMs.
> >
> > Why would this only a problem for VMMs?
>
> VMMs may have to emulate different hardware for different guest OS's,
> and they would likely "context switch" XCR0 to achieve that.
>
> As switching XCR0 at run-time would confuse the heck out of user-space,
> it was not imagined that a bare-metal OS would do that.

to clarify...
*switching* XCR0 on context switch is slow, but perfectly legal.

*changing* XCR0 during the lifetime of a process, in any of its tasks,
on any of its CPUs, will confuse any software that uses xgetbv/XCR0
to calculate the size of XSAVE buffers for userspace threading.

-- 
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-29 Thread Len Brown

On Sat, Mar 27, 2021 at 5:58 AM Greg KH  wrote:
>
> On Fri, Mar 26, 2021 at 11:39:18PM -0400, Len Brown wrote:
> > Hi Andy,
> >
> > Say a mainline links with a math library that uses AMX without the
> > knowledge of the mainline.

sorry for the confusion.

mainline = main().

ie. the part of the program written by you, and not the library you linked with.

In particular, the library may use instructions that main() doesn't know exist.

-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown

On Mon, Mar 29, 2021 at 9:33 AM Thomas Gleixner  wrote:

> > I found the author of this passage, and he agreed to revise it to say this
> > was targeted primarily at VMMs.
>
> Why would this only a problem for VMMs?

VMMs may have to emulate different hardware for different guest OS's,
and they would likely "context switch" XCR0 to achieve that.

As switching XCR0 at run-time would confuse the heck out of user-space,
it was not imagined that a bare-metal OS would do that.

But yes, if a bare metal OS doesn't support any threading libraries
that query XCR0 with xgetbv, and they don't care about the performance
impact of switching XCR0, they could choose to switch XCR0 and
would want to TILERELEASE to assure C6 access, if it is enabled.

> > "negative power and performance implications" refers to the fact that
> > the processor will not enter C6 when AMX INIT=0, instead it will demote
> > to the next shallower C-state, eg C1E.
> >
> > (this is because the C6 flow doesn't save the AMX registers)
> >
> > For customers that have C6 enabled, the inability of a core to enter C6
> > may impact the maximum turbo frequency of other cores.
>
> That's the same on bare metal, right?

Yes, the hardware works exactly the same way.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-29 Thread Len Brown

On Sat, Mar 27, 2021 at 6:20 PM Thomas Gleixner  wrote:

> What's the actual downside of issuing TILERELEASE conditionally
> depending on prev->AMX INIT=0? Is it slow or what's the real
> problem here?

TILERELEASE is fast, so there should be no down-side to execute it.
Indeed, checking whether you need to execute it or not will probably take
longer than executing TILERELEASE.  My point (perhaps academic)
is that Linux should not have to know about TILERELEASE, or execute it.

Re: running in the kernel with AMX INIT=0

AMX INIT=0 will prevent c6 on that core.  I don't expect to see this
in the syscall path, though if a user wanted to neglect to issue TILERELEASE,
there is nothing forcing them to do so.

It can certainly happen on the interrupt path, but on the interrupt patch
I don't know if we can end up requesting c6 -- perhaps on a forced
task migration?

Re:  frequency credits in the kernel with AMX INIT=0.

It works exactly the same way as AMX INIT=1.
That is to say, the frequency credits don't key off of AMX INIT,
they key off of the actual use of the AMX execution unit, and
the credits free up several orders of magnitude faster
(both for AVX-512 and AMX) on this hardware as in previous generations.

As a result, if we interrupt an AMX program, and run for an extended
period of time in the kernel without XRESTOR to clear out his AMX INIT=0 state,
that will not have any impact on the frequency we run inside the kernel any more
than if he had AMX INIT=1 state.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-29 Thread Len Brown

On Sat, Mar 20, 2021 at 6:14 PM Thomas Gleixner  wrote:
>
> On Sun, Feb 21 2021 at 10:56, Chang S. Bae wrote:
> > +
> > +/* Update MSR IA32_XFD with xfirstuse_not_detected() if needed. */
> > +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> > +{
> > + if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > + return;
> > +
> > + if (unlikely(prev->state_mask != next->state_mask))
> > + xdisable_setbits(xfirstuse_not_detected(next));
> > +}
>
> So this is invoked on context switch. Toggling bit 18 of MSR_IA32_XFD
> when it does not match. The spec document says:
>
>   "System software may disable use of Intel AMX by clearing XCR0[18:17], by
>clearing CR4.OSXSAVE, or by setting IA32_XFD[18]. It is recommended that
>system software initialize AMX state (e.g., by executing TILERELEASE)
>before doing so. This is because maintaining AMX state in a
>non-initialized state may have negative power and performance
>implications."
>
> I'm not seeing anything related to this. Is this a recommendation
> which can be ignored or is that going to be duct taped into the code
> base once the first user complains about slowdowns of their non AMX
> workloads on that machine?

I found the author of this passage, and he agreed to revise it to say this
was targeted primarily at VMMs.

"negative power and performance implications" refers to the fact that
the processor will not enter C6 when AMX INIT=0, instead it will demote
to the next shallower C-state, eg C1E.

(this is because the C6 flow doesn't save the AMX registers)

For customers that have C6 enabled, the inability of a core to enter C6
may impact the maximum turbo frequency of other cores.

thanks,
-Len Brown
Intel Open Source Technology Center

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

> 3.3 RECOMMENDATIONS FOR SYSTEM SOFTWARE
>
> System software may disable use of Intel AMX by clearing XCR0[18:17],
> by clearing CR4.OSXSAVE, or by setting
> IA32_XFD[18]. It is recommended that system software initialize AMX
> state (e.g., by executing TILERELEASE)
> before doing so. This is because maintaining AMX state in a
> non-initialized state may have negative power and
> performance implications.

I agree that the wording here about disabling AMX is ominous.

The hardware initializes with AMX disabled.
The kernel probes AMX, enables it in XCR0, and keeps it enabled.

Initially, XFD is "armed" for all tasks.
When a task accesses AMX state, #NM fires, we allocate a context
switch buffer, and we "disarm" XFD for that task.
As we have that buffer in-hand for the lifetime of the task, we never
"arm" XFD for that task again.

XFD is context switched, and so the next time it is set, is when we
are restoring some other task's state.

n.b. I'm describing the Linux flow.  The VMM scenario is a little different.

> Since you reviewed the patch set, I assume you are familiar with how
> Linux manages XSTATE.  Linux does *not* eagerly load XSTATE on context
> switch.  Instead, Linux loads XSTATE when the kernel needs it loaded
> or before executing user code.  This means that the kernel can (and
> does, and it's a performance win) execute kernel thread code and/or go
> idle, *including long-term deep idle*, with user XSTATE loaded.

Yes, this scenario is clear.

There are several cases.

1. Since TMM registers are volatile, a routine using TMM that wants
them to persist across a call must save them,
and will TILERELEASE before invoking that call.  That is the
calling convention,
and I expect that if it is not followed, debugging (of tools) will
occur until it is.

The only way for a user program's XSTATE to be present during the
kernel's call to idle
is if it sleep via a system call when no other task wants to run
on that CPU.

Since system calls are calls, in this case, AMX INIT=1 during idle.
All deep C-state are enabled, the idle CPU is able to contribute
it's maximum turbo buget to its peers.

2. A correct program with live TMM registers takes an interrupt, and
we enter the kernel AMX INIT=0.
Yes, we will enter the syscall at the frequency of the app (like
we always do).
Yes, turbo frequency may be limited by the activity of this
processor and its peers (like it always is)

   2a. If we return to the same program, then depending on how long
the syscall runs, we may execute
 the program and the system call code at a frequency lower
than we might if AMX INIT=1 at time of interrupt.

   2b. If we context switch to a task that has AMX INIT=1, then any
AMX-imposed limits on turbo
 are immediately gone.

Note for 2b.  4 generations have passed since SKX had significant
delay releasing AVX-512 credits.
The delay in the first hardware that supports AXM should be
negligible for both AVX-512 and AMX.

3. A buggy or purposely bogus program is fully empowered to violate
the programming conventions.
Say such a program called a long sleep, and nothing else wanted to
run on that CPU, so the kernel
went idle with AMX INIT=0.  Indeed, this could retard the core
from getting into the deepest available
C-state, which could impact the turbo budget of neighboring cores.
However, if that were some kind
of DOS, it would be simpler and more effective to simply hog a CPU
by running code.  Also, as soon
as another thread switches in with INIT=1, there is no concept of
AMX frequency caps. (see note for 2b)

I do not see a situation where the kernel needs to issue TILERELEASE
(though a VMM likely would).
What did I miss?

thanks,
Len Brown, Intel Open Source Technology Center

ps. I will respond to your ABI thoughts on your new ABI thread.

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

On Fri, Mar 26, 2021 at 2:17 PM Borislav Petkov  wrote:
>
> On Fri, Mar 26, 2021 at 01:53:47PM -0400, Len Brown wrote:
> > At Dave's suggestion, we had a 64 *KB* sanity check on this path.
> > Boris forced us to remove it, because we could not tell him
> > how we chose the number 64.
>
> The only 64 I can remember is
>
> #define XSTATE_BUFFER_MAX_BYTES  (64 * 1024)
>
> What does an arbitrary number have to do with signal handling and
> pushing a fat frame on the sigaltstack?

You are right.  If that is where the check was, it was the wrong place.
It should be part of that sanity check code where CPUID is parsed.

thanks,
Len Brown, Intel Open Source Technology Center

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-26 Thread Len Brown

Hi Andy,

Say a mainline links with a math library that uses AMX without the
knowledge of the mainline.
Say the mainline is also linked with a userspace threading library
that thinks it has a concept of XSAVE area size.

Wouldn't the change in XCR0, resulting in XSAVE size change, risk
confusing the threading library?

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

On Fri, Mar 26, 2021 at 11:48 AM Andy Lutomirski  wrote:

> > I submit, that after the generic XFD support is in place,
> > there is exactly 1 bit that needs to be flipped to enable
> > user applications to benefit from AMX.
>
> The TILERELEASE opcode itself is rather longer than one bit, and the
> supporting code to invoke it at the right time, to avoid corrupting
> user state, and avoid causing performance regressions merely by
> existing will be orders of magnitude more than 1 bit.  Of course, all
> of this is zero bits in the current series because the code is
> missing.entirely.

Please explain why the kernel must know about the TILERELEASE
instruction in order for an AMX application to run properly.

> This isn't just about validation.  There's also ABI, performance, and
> correctness.

Thank you for agreeing that this is not about unvalidated features.

> ABI: The AVX-512 enablement *already* broke user ABI.  Sadly no one
> told anyone in the kernel community until about 5 years after the
> fact, and it's a bit late to revert AVX-512.  But we don't want to
> enable AMX until the ABI has a reasonable chance of being settled.
> Ditto for future features.  As it stands, if you xstate.enable some
> 16MB feature, the system may well simply fail to boot as too many user
> processes explode.

At Dave's suggestion, we had a 64 *KB* sanity check on this path.
Boris forced us to remove it, because we could not tell him
how we chose the number 64.

I would be delighted to see a check for 64 KB restored, and that
it be a rejection, rather than warning.  At this point, as there is no way
go down that path without manually modifying the kernel, it would
devolve into a sanity check for a hardware (CPUID) bug.

> Performance:
>
> We *still* don't know the performance implications of leaving the AMX
> features in use inappropriately.  Does it completely destroy idle?

No.

> Will it literally operate CPUs out of spec such that Intel's
> reliability estimates will be invalidated?

No.

>  (We had that with NVMe APST.  Let's not repeat this with XSTATE.)

I acknowledge that the possibility of broken hardware always exists.
However, I don't see how the experience with broken NVMe actually applies here,
other than general paranoia about new features (which is, arguably, healthy).

>  The performance impacts
> and transitions for AVX-512 are, to put it charitably, forthcoming.

I acknowledge the parallels with AVX-512, in that AMX adds new instructions,
and it has even bigger registers.  I also acknowledge that the AVX-512 rollout
(and arguably, its brief existence on client CPUs) was problematic.

My understanding is that Intel continues to learn (a lot) from its mistakes.
I believe that the AVX-512 credits problem has been largely eliminated
on newer Xeons.

My understanding is that AMX is implemented only in CPUs that actually
have the hardware to properly support AMX.  If it were not, then that would
be a problem for Intel to deal with in hardware, not a problem for Linux
to deal with in software.

> Correctness: PKRU via the kernel's normal XSAVE path would simply be
> incorrect.  Do we really trust that this won't get repeated?  Also,
> frankly, a command line option that may well break lots of userspace
> but that we fully expect Intel to recommend setting is not a good
> thing.

There is no analogy between AMX and PKRU, except the fact that they
are both features, and at one time, both were new.

I am unaware of anybody at Intel recommending that any cmdline
be set that would break userspace.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

On Thu, Mar 25, 2021 at 9:50 PM Thomas Gleixner  wrote:

> Please provide the architectural document which guarantees that and does
> so in a way that it can be evaluated by the kernel. Have not seen that,
> so it does not exist at all.
>
>   Future CPUID attributes are as useful as the tweet of today.

I will do so the moment I am permitted.
I'm fine with dropping patch 22 until it can rely on the assurance of
that architectural feature.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

On Thu, Mar 25, 2021 at 9:42 PM Andy Lutomirski  wrote:

> Regardless of what you call AMX, AMX requires kernel enabling.

I submit, that after the generic XFD support is in place,
there is exactly 1 bit that needs to be flipped to enable
user applications to benefit from AMX.

I submit the patch that knows about AMX and double checks the
state size is superfluous.

I submit that updating /proc/cpuinfo is superfluous.

What AMX-specific kernel enabling did I miss?

> Specifically, it appears that leaving AMX in use in the XINUSE sense
> degrades system performance and/or power.

Please share the specifics about what performance or power issue you anticipate.

thanks,
-Len

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-26 Thread Len Brown

On Thu, Mar 25, 2021 at 7:10 PM Dave Hansen  wrote:
>
> On 3/25/21 3:59 PM, Len Brown wrote:
> > We call AMX a "simple state feature" -- it actually requires NO KERNEL 
> > ENABLING
> > above the generic state save/restore to fully support userspace AMX
> > applications.
> >
> > While not all ISA extensions can be simple state features, we do expect
> > future features to share this trait, and so we want to be sure that it is 
> > simple
> > to update the kernel to turn those features on (and when necessary, off).
>
> From some IRC chats with Thomaas and Andy, I think it's safe to say that
> they're not comfortable blindly enabling even our "simple features".  I
> think we're going to need at least some additional architecture to get
> us to a point where everyone will be comfortable.

Hi Dave,

There is no code in this patch series, including patch 22, that enables
an unvalidated feature by default.

Yes, I fully accept that patch 22 allows a user to enable something
that a distro didn't validate.

If there is a new requirement that the kernel cmdline not allow anything
that a distro didn't explicitly validate, then about 99.9% of the kernel cmdline
options that exist today would need to be removed.

Does such a requirement exist, or does it not?

thanks,
-Len

Re: [PATCH v4 22/22] x86/fpu/xstate: Introduce boot-parameters to control state component support

2021-03-25 Thread Len Brown

On Sat, Mar 20, 2021 at 4:57 PM Thomas Gleixner  wrote:

> We won't enable features which are unknown ever. Keep that presilicon
> test gunk where it belongs: In the Intel poison cabinet along with the
> rest of the code which nobody ever want's to see.

I agree, it would be irresponsible to enable unvalidated features by default,
and pre-silicon "test gunk" should be kept out of the upstream kernel.

This patch series is intended solely to enable fully validated
hardware features,
with product quality kernel support.

The reason that the actual AMX feature isn't mentioned until the 16th
patch in this series
is because all of the patches before it are generic state save/restore patches,
that are not actually specific to AMX.

We call AMX a "simple state feature" -- it actually requires NO KERNEL ENABLING
above the generic state save/restore to fully support userspace AMX
applications.

While not all ISA extensions can be simple state features, we do expect
future features to share this trait, and so we want to be sure that it is simple
to update the kernel to turn those features on (and when necessary, off).

There will be a future CPUID attribute that will help us identify
future simple-state features.
For AMX, of course, we simply know.

So after the generic state management support, the kernel enabling of AMX
is not actually required to run applications.  Just like when a new instruction
is added that re-uses existing state -- the application or library can check
CPUID and just use it.  It is a formality (perhaps an obsolete one), that
we add every feature flag to /proc/cpuid for the "benefit" of userspace.

The reason we propose this cmdline switch is
1. Ability of customers to disable a feature right away if an issue is found.
Unlike the CPUid cmdline that works on flags, this is the ability to turn
off a feature based on its state number.  Ie.  There could be 20 features
that use the same state, and you can turn them all off at once this way.

2. Ability of customers to enable a feature that is disabled by default
in their kernel.  Yes, this will taint their kernel (thanks Andy),
but we have customers that want to run the new feature on day 0
before they have got a distro update to change the default, and this
gives them a way to do that.

Yeah, the cmdline syntax is not a user-friendly mnemonic, and I don't know
that making it so would be an improvement.
Like the CPUID cmdline, it is precise, it is future-proof, and it is
used only in special situations.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-24 Thread Len Brown

On Tue, Mar 23, 2021 at 11:15 PM Liu, Jing2  wrote:

> > IMO, the problem with AVX512 state
> > is that we guaranteed it will be zero for XINUSE=0.
> > That means we have to write 0's on saves.

> why "we have to write 0's on saves" when XINUSE=0.
>
> Since due to SDM, if XINUSE=0, the XSAVES will *not* save the data and
> xstate_bv bit is 0; if use XSAVE, it need save the state but
> xstate_bv bit is also 0.
> >   It would be better
> > to be able to skip the write -- even if we can't save the space
> > we can save the data transfer.  (This is what we did for AMX).
> With XFD feature that XFD=1, XSAVE command still has to save INIT state
> to the area. So it seems with XINUSE=0 and XFD=1, the XSAVE(S) commands
> do the same that both can help save the data transfer.

Hi Jing, Good observation!

There are 3 cases.

1. Task context switch save into the context switch buffer.
Here we use XSAVES, and as you point out, XSAVES includes
the compaction optimization feature tracked by XINUSE.
So when AMX is enabled, but clean, XSAVES doesn't write zeros.
Further, it omits the buffer space for AMX in the destination altogether!
However, since XINUSE=1 is possible, we have to *allocate* a buffer
large enough to handle the dirty data for when XSAVES can not
employ that optimization.

2. Entry into user signal handler saves into the user space sigframe.
Here we use XSAVE, and so the hardware will write zeros for XINUSE=0,
and for AVX512, we save neither time or space.

My understanding that for application compatibility, we can *not* compact
the destination buffer that user-space sees.  This is because existing code
may have adopted fixed size offsets.  (which is unfortunate).

And so, for AVX512, we both reserve the space, and we write zeros
for clean AVX512 state.

For AMX, we must still reserve the space, but we are not going to write zeros
for clean state.  We so this in software by checking XINUSE=0, and clearing
the xstate_bf for the XSAVE.  As a result, for XINUSE=0, we can skip
writing the zeros, even though we can't compress the space.

3. user space always uses fully uncompacted XSAVE buffers.

> The reason I'm interested in XINUSE denotation is that it might be helpful
> for the XFD MSRs context switch cost during vmexit and vmenter.

As the guest OS may be using XFD, the VMM can not use it for itself.
Rather, the VMM must context switch it when it switches between guests.
(or not expose it to guests at all)

cheers,
-Len

cheers,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v4 14/22] x86/fpu/xstate: Expand the xstate buffer on the first use of dynamic user state

2021-03-23 Thread Len Brown

> I have an obnoxious question: do we really want to use the XFD mechanism?

Obnoxious questions are often the most valuable! :-)

> Right now, glibc, and hence most user space code, blindly uses
> whatever random CPU features are present for no particularly good
> reason, which means that all these features get stuck in the XINUSE=1
> state, even if there is no code whatsoever in the process that
> benefits.  AVX512 is bad enough as we're seeing right now.  AMX will
> be much worse if this happens.
>
> We *could* instead use XCR0 and require an actual syscall to enable
> it.  We could even then play games like requiring whomever enables the
> feature to allocate memory for the state save area for signals, and
> signal delivery could save the state and disable the feature, this
> preventing the signal frame from blowing up to 8 or 12 or who knows
> how many kB.

This approach would have some challenges.

Enumeration today is two parts.
1. CPUID tells you if the feature exists in the HW
2. xgetbv/XCR0 tells you if the OS supports that feature

Since #2 would be missing, you are right, there would need to be
a new API enabling the user to request the OS to enable support
for that task.

If that new API is not invoked before the user touches the feature,
they die with a #UD.

And so there would need to be some assurance that the API is successfully
called before any library might use the feature.  Is there a practical way
to guarantee that, given that the feature may be used (or not) only by
a dynamically
linked library?

If a library spawns threads and queries the size of XSAVE before the API
is called, it may be confused when that size changes after the API is called.

So a simple question, "who calls the API, and when?" isn't so simple.

Finally, note that XCR0 updates cause a VMEXIT,
while XFD updates do not.

So context switching XCR0 is possible, but is problematic.

The other combination is XFD + API rather than XCR0 + API.
With XFD, the context switching is faster, and the faulting (#NM and
the new MSR with #NM cause) is viable.
We have the bit set in XCR0, so no state size advantage.
Still have issues with API logistics.
So we didn't see that the API adds any value, only pain,
over transparent 1st use enabling with XFD and no API.

cheers,
Len Brown, Intel Open Source Technology Center

ps. I agree that un-necessary XINUSE=1 is possible.
Notwithstanding the issues initially deploying AVX512, I am skeptical
that it is common today.  IMO, the problem with AVX512 state
is that we guaranteed it will be zero for XINUSE=0.
That means we have to write 0's on saves.  It would be better
to be able to skip the write -- even if we can't save the space
we can save the data transfer.  (This is what we did for AMX).

pps. your idea of requiring the user to allocate their own signal stack
is interesting.   It isn't really about allocating the stack, though --
the stack of the task that uses the feature is generally fine already.
The opportunity is to allow tasks that do *not* use the new feature to
get away with minimal data transfer and stack size.  As we don't
have the 0's guarantee for AMX, we bought the important part
of that back.

Re: [PATCH v7 0/6] x86: Improve Minimum Alternate Stack Size

2021-03-19 Thread Len Brown

On Wed, Mar 17, 2021 at 6:45 AM Ingo Molnar  wrote:
>
>
> * Ingo Molnar  wrote:
>
> >
> > * Chang S. Bae  wrote:
> >
> > > During signal entry, the kernel pushes data onto the normal userspace
> > > stack. On x86, the data pushed onto the user stack includes XSAVE state,
> > > which has grown over time as new features and larger registers have been
> > > added to the architecture.
> > >
> > > MINSIGSTKSZ is a constant provided in the kernel signal.h headers and
> > > typically distributed in lib-dev(el) packages, e.g. [1]. Its value is
> > > compiled into programs and is part of the user/kernel ABI. The MINSIGSTKSZ
> > > constant indicates to userspace how much data the kernel expects to push 
> > > on
> > > the user stack, [2][3].
> > >
> > > However, this constant is much too small and does not reflect recent
> > > additions to the architecture. For instance, when AVX-512 states are in
> > > use, the signal frame size can be 3.5KB while MINSIGSTKSZ remains 2KB.
> > >
> > > The bug report [4] explains this as an ABI issue. The small MINSIGSTKSZ 
> > > can
> > > cause user stack overflow when delivering a signal.
> >
> > >   uapi: Define the aux vector AT_MINSIGSTKSZ
> > >   x86/signal: Introduce helpers to get the maximum signal frame size
> > >   x86/elf: Support a new ELF aux vector AT_MINSIGSTKSZ
> > >   selftest/sigaltstack: Use the AT_MINSIGSTKSZ aux vector if available
> > >   x86/signal: Detect and prevent an alternate signal stack overflow
> > >   selftest/x86/signal: Include test cases for validating sigaltstack
> >
> > So this looks really complicated, is this justified?
> >
> > Why not just internally round up sigaltstack size if it's too small?
> > This would be more robust, as it would fix applications that use
> > MINSIGSTKSZ but don't use the new AT_MINSIGSTKSZ facility.
> >
> > I.e. does AT_MINSIGSTKSZ have any other uses than avoiding the
> > segfault if MINSIGSTKSZ is used to create a small signal stack?
>
> I.e. if the kernel sees a too small ->ss_size in sigaltstack() it
> would ignore ->ss_sp and mmap() a new sigaltstack instead and use that
> for the signal handler stack.
>
> This would automatically make MINSIGSTKSZ - and other too small sizes
> work today, and in the future.
>
> But the question is, is there user-space usage of sigaltstacks that
> relies on controlling or reading the contents of the stack?
>
> longjmp using programs perhaps?

For the legacy binary that requests a too-small sigaltstack, there are
several choices:

We could detect the too-small stack at sigaltstack(2) invocation and
return an error.
This results in two deal-killing problems:
First, some applications don't check the return value, so the check
would be fruitless.
Second, those that check and error-out may be programs that never
actually take the signal, and so we'd be causing a dusty binary to
exit, when it didn't exit on another system, or another kernel.

Or we could detect the too small stack at signal registration time.
This has the same two deal-killers as above.

Then there is the approach in this patch-set, which detects an
imminent stack overflow at run time.
It has neither of the two problems above, and the benefit that we now
prevent data corruption
that could have been happening on some systems already today.  The
down side is that the dusty binary
that does request the too-small stack can now die at run time.

So your idea of recognizing the problem and conjuring up a sufficient
stack is compelling,
since it would likely "just work", no matter how dumb the program.
But where would the
the sufficient stack come from -- is this a new kernel buffer, or is
there a way to abscond
some user memory?  I would expect a signal handler to look at the data
on its stack
and nobody else will look at that stack.  But this is already an
unreasonable program for
allocating a special signal stack in the first place :-/ So yes, one
could imagine the signal
handler could longjump instead of gracefully completing, and if this
specially allocated
signal stack isn't where the user planned, that could be trouble.

Another idea we discussed was to detect the potential overflow at run-time,
and instead of killing the process, just push the signal onto the
regular user stack.
this might actually work, but it is sort of devious; and it would not
work in the case
where the user overflowed their regular stack already, which may be
the most (only?)
compelling reason that they allocated and declared a special
sigaltstack in the first place...

-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask

2021-03-12 Thread Len Brown

Doug,
The offset works for control.

However, it is erroneous to use it for reporting of the actual
temperature, like I did in turbostat.
Thus, I'm going to revert the patch that added it's use in turbostat
for the Temperature column.

thanks,
-Len

On Fri, Mar 12, 2021 at 1:26 AM Doug Smythies  wrote:
>
> Hi Len,
>
>
> thank you for your reply.
>
> On Thu, Mar 11, 2021 at 3:19 PM Len Brown  wrote:
> >
> > Thanks for the close read, Doug.
> >
> > This field size actually varies from system to system,
> > but the reality is that the offset is never that big, and so the
> > smaller mask is sufficient.
>
> Disagree.
>
> I want to use an offset of 26.
>
> > Finally, this may all be moot, because there is discussion that using
> > the offset this way is simply erroneous.
>
> Disagree.
> It works great.
> As far as I know/recall I was the only person that responded to Rui's thread
> "thermal/intel: introduce tcc cooling driver" [1]
> And, I spent quite a bit of time doing so.
> However, I agree the response seems different between the two systems
> under test, Rui's and mine.
>
> [1] https://marc.info/?l=linux-pm=161070345329806=2
>
> >  stay tuned.
>
> O.K.
>
> ... Doug
> >
> > -Len
> >
> >
> > On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies  
> > wrote:
> > >
> > > The TCC offset mask is incorrect, resulting in
> > > incorrect target temperature calculations, if
> > > the offset is big enough to exceed the mask size.
> > >
> > > Signed-off-by: Doug Smythies 
> > > ---
> > >  tools/power/x86/turbostat/turbostat.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/tools/power/x86/turbostat/turbostat.c 
> > > b/tools/power/x86/turbostat/turbostat.c
> > > index 389ea5209a83..d7acdd4d16c4 100644
> > > --- a/tools/power/x86/turbostat/turbostat.c
> > > +++ b/tools/power/x86/turbostat/turbostat.c
> > > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp()
> > >
> > > target_c = (msr >> 16) & 0xFF;
> > >
> > > -   offset_c = (msr >> 24) & 0xF;
> > > +   offset_c = (msr >> 24) & 0x3F;
> > >
> > > tcc = target_c - offset_c;
> > >
> > > --
> > > 2.25.1
> > >
> >
> >
> > --
> > Len Brown, Intel Open Source Technology Center



-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask

2021-03-11 Thread Len Brown

Thanks for the close read, Doug.

This field size actually varies from system to system,
but the reality is that the offset is never that big, and so the
smaller mask is sufficient.

Finally, this may all be moot, because there is discussion that using
the offset this way is simply erroneous.  stay tuned.

-Len


On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies  wrote:
>
> The TCC offset mask is incorrect, resulting in
> incorrect target temperature calculations, if
> the offset is big enough to exceed the mask size.
>
> Signed-off-by: Doug Smythies 
> ---
>  tools/power/x86/turbostat/turbostat.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/power/x86/turbostat/turbostat.c 
> b/tools/power/x86/turbostat/turbostat.c
> index 389ea5209a83..d7acdd4d16c4 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp()
>
> target_c = (msr >> 16) & 0xFF;
>
> -   offset_c = (msr >> 24) & 0xF;
> +   offset_c = (msr >> 24) & 0x3F;
>
> tcc = target_c - offset_c;
>
> --
> 2.25.1
>


-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH V2] rtc: mc146818: Dont test for bit 0-5 in Register D

2021-02-01 Thread Len Brown

Thanks for the update, Thomas.

V1 prevented rc6 automated suspend/resume testing on all 13 of my
local machines.
V2 applied, and they are back in business.

tested-by: Len Brown 

On Mon, Feb 1, 2021 at 2:25 PM Thomas Gleixner  wrote:
>
> The recent change to validate the RTC turned out to be overly tight.
>
> While it cures the problem on the reporters machine it breaks machines
> with Intel chipsets which use bit 0-5 of the D register. So check only
> for bit 6 being 0 which is the case on these Intel machines as well.
>
> Fixes: 211e5db19d15 ("rtc: mc146818: Detect and handle broken RTCs")
> Reported-by: Serge Belyshev 
> Reported-by: Dirk Gouders 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: Provide the actual delta patch. Should have stayed away from
> computers today
> ---
>  drivers/rtc/rtc-cmos.c |4 ++--
>  drivers/rtc/rtc-mc146818-lib.c |4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> --- a/drivers/rtc/rtc-cmos.c
> +++ b/drivers/rtc/rtc-cmos.c
> @@ -805,8 +805,8 @@ cmos_do_probe(struct device *dev, struct
>
> spin_lock_irq(_lock);
>
> -   /* Ensure that the RTC is accessible. Bit 0-6 must be 0! */
> -   if ((CMOS_READ(RTC_VALID) & 0x7f) != 0) {
> +   /* Ensure that the RTC is accessible. Bit 6 must be 0! */
> +   if ((CMOS_READ(RTC_VALID) & 0x40) != 0) {
> spin_unlock_irq(_lock);
> dev_warn(dev, "not accessible\n");
> retval = -ENXIO;
> --- a/drivers/rtc/rtc-mc146818-lib.c
> +++ b/drivers/rtc/rtc-mc146818-lib.c
> @@ -21,8 +21,8 @@ unsigned int mc146818_get_time(struct rt
>
>  again:
> spin_lock_irqsave(_lock, flags);
> -   /* Ensure that the RTC is accessible. Bit 0-6 must be 0! */
> -   if (WARN_ON_ONCE((CMOS_READ(RTC_VALID) & 0x7f) != 0)) {
> +   /* Ensure that the RTC is accessible. Bit 6 must be 0! */
> +   if (WARN_ON_ONCE((CMOS_READ(RTC_VALID) & 0x40) != 0)) {
> spin_unlock_irqrestore(_lock, flags);
> memset(time, 0xff, sizeof(*time));
> return 0;



-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH v2 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support

2020-12-01 Thread Len Brown

> > On Fri, Nov 20, 2020 at 12:03 AM Andy Lutomirski  wrote:

> We may want to taint the kernel if one of these flags is used

I agree with Andy.
I can imagine a distro would want to see the taint bit set if a user
overrides the default that they ship and support.

Chang,
Please taint when the enable flag is used -- Thanks!

> because,
> frankly, Intel’s track record is poor. Suppose we get a new feature
> with PKRU-like semantics -- switching it blindly using
> XSAVE(C,S,OPT,whatever) would simply incorrect.

A valid concern.
To mitigate, there is a new CPUID bit on the way to tell us explicitly
which features are simple (require just XSAVE), versus which
features are not simple and require more complex support.n t
When that bit is locked down, we will add that sanity check here.

> And XFD itself has
> problems — supposedly it’s difficult or impossible to virtualize. It
> wouldn’t utterly shock me if Intel were to drop IA32_XFD_ERR and
> replace it with a new mechanism that’s less janky.
>
> So this whole thing makes me quite nervous.
>
> (I'm not a virtualization expert, but AIUI IA32_XFD_ERR has some
> issues.  If it's too late to fix those issues, Intel could probably
> get away with completely dropping IA32_XFD_ERR from the spec -- OSes
> can handle AMX just fine without it.  Then the next XFD-able feature
> could introduce a new improved way of reporting which feature
> triggered #NM.)

I don't follow the XFD concern.

There are a couple of cases.

If the hypervisor isn't aware of XFD, it neither uses it, or shows it
to the guests.

If the hypervisor is aware of XFD and supports TMUL, it allocates its
own context
switch buffers at boot time, and does not use the XFD optimization to
save space inside the hypervisor.  It can expose XFD to the guests,
(and context switch it when guests switch on/off physical CPUs)

The guest sees two cases, either XFD exists or it doesn't, just like
on real hardware.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v2 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support

2020-11-24 Thread Len Brown

On Fri, Nov 20, 2020 at 12:03 AM Andy Lutomirski  wrote:
>
> On Thu, Nov 19, 2020 at 3:37 PM Chang S. Bae  wrote:
> > "xstate.enable=0x6" will enable AMX on a system that does NOT have AMX
> > compiled into XFEATURE_MASK_USER_ENABLED (assuming the kernel is new enough
> > to support this feature).
> >
>
> What's the purpose of xstate.enable?  I can't really imagine it's
> useful for AMX.  I suppose it could be useful for hypothetical
> post-AMX features, but that sounds extremely dangerous.  Intel has
> changed its strategy so many times on XSTATE extensibility that I find
> it quite hard to believe that supporting unknown states is wise.

Not hypothetical -- there are subsequent hardware features coming that
will use the same
exact XSTATE support that this series puts in place for AMX.

We know that when those features ship in new hardware, there will be
a set of customers who want to exercise those features immediately,
but their kernel binary has not yet been re-compiled to see those
features by-default.

The purpose of "xstate.enable" is to empower those users to be able to
explicitly enable support using their existing binary.

You are right -- the feature isn't needed to enable AMX, unless somebody went to
the trouble of building a kernel with the AMX source update, but chose
to disable
AMX-specific recognition, by-default.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH v2 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent

2020-11-24 Thread Len Brown

On Fri, Nov 20, 2020 at 12:08 AM Andy Lutomirski  wrote:
>
> On Thu, Nov 19, 2020 at 3:37 PM Chang S. Bae  wrote:
> >
> > When a new task is created, the kernel copies all the states from the
> > parent. If the parent already has any dynamic user state in use, the new
> > task has to expand the XSAVE buffer to save them. Also, disable the
> > associated first-use fault.
>
> This seems like a mistake.  If init uses AMX for some misguided
> reason, ever task on the whole system will end up with AMX state
> allocated.

Andy, you are right -- the child can (and should) start with the
un-expanded context switch buffer, and as a result XFD should be armed
to fire on the child's first access to TMM hardware.

TMM registers are scratchpad, they will never be used to pass globals,
say, from parent to child thread.
Further, they are volatile and caller saved.  The callee can assume
they are empty -- so even by virtue of being in a fork system call,
that state is already gone.

thanks,
Len Brown, Intel Open Source Technology Center

[GIT PULL] turbostat: update to version 20.09.30

2020-11-10 Thread Len Brown

Hi Linus,

Please pull these turbostat related patches.

thanks!
Len Brown, Intel Open Source Technology Center

The following changes since commit e00b62f0b06d0ae2b844049f216807617aff0cdb:

  x86/cpu: Add Lakefield, Alder Lake and Rocket Lake models to the to
Intel CPU family (2020-07-25 12:16:59 +0200)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git turbostat

for you to fetch changes up to 3e9fa9983b9297407c2448114d6d27782d5e2ef2:

  tools/power turbostat: update version number (2020-11-10 11:41:36 -0500)


Alexander A. Klimov (1):
  tools/power turbostat: Replace HTTP links with HTTPS ones:
TURBOSTAT UTILITY

Alexander Monakov (1):
  tools/power turbostat: Build with _FILE_OFFSET_BITS=64

Antti Laakso (1):
  tools/power turbostat: Remove empty columns for Jacobsville

Chen Yu (3):
  tools/power turbostat: Make the energy variable to be 64 bit
  tools/power turbostat: Introduce functions to accumulate RAPL consumption
  tools/power turbostat: Enable accumulate RAPL display

David Arcari (1):
  tools/power turbostat: Fix output formatting for ACPI CST enumeration

Doug Smythies (1):
  tools/power turbostat: Always print idle in the system
configuration header

Kim Phillips (1):
  tools/power turbostat: Support AMD Family 19h

Len Brown (7):
  tools/power turbostat: Print /dev/cpu_dma_latency
  tools/power turbostat: Support additional CPU model numbers
  tools/power turbostat: Skip pc8, pc9, pc10 columns, if they are disabled
  tools/power turbostat: adjust for temperature offset
  tools/power turbostat: harden against cpu hotplug
  powercap: restrict energy meter to root access
  tools/power turbostat: update version number

Ondřej Lysoněk (1):
  tools/power x86_energy_perf_policy: Input/output error in a VM

Prarit Bhargava (1):
  tools/power turbostat: Use sched_getcpu() instead of hardcoded cpu 0

Rafael Antognolli (1):
  tools/power turbostat: Add a new GFXAMHz column that exposes
gt_act_freq_mhz.

 drivers/powercap/powercap_sys.c|   4 +-
 tools/power/x86/turbostat/Makefile |   3 +-
 tools/power/x86/turbostat/turbostat.8  |   2 +-
 tools/power/x86/turbostat/turbostat.c  | 573 -
 .../x86_energy_perf_policy.c   |  67 ++-
 5 files changed, 509 insertions(+), 140 deletions(-)

Re: [PATCH] tools/power turbostat: call pread64 in kernel directly

2020-09-03 Thread Len Brown

> -D_FILE_OFFSET_BITS=64

Applied.

thanks!
-Len

On Mon, Aug 24, 2020 at 12:09 AM Liwei Song  wrote:
>
>
>
> On 8/24/20 04:54, Alexander Monakov wrote:
> > Hi,
> >
> > I am not the original submitter, but I have answers and a proper patch :)
> >
> > On Fri, 21 Aug 2020, Len Brown wrote:
> >
> >> Re: offset size
> >>
> >> The offsets on this file are the MSR offsets.
> >> What MSR are you trying to access at offset 0xc0010299?
> >
> > This MSR is particular is part of AMD RAPL (energy measurements) interface.
> >
> >> Re: pread vs pread64
> >>
> >> If I take on faith that you have some kind of 32-bit execution
> >> environment that makes pread into pread32 instead of pread64, and that
> >> truncates an off_t to 32-bits from 64-bits, and it actually makes
> >> sense to request a read at this large offset...
> >
> > The problem here stems from the backward compatibility in Glibc: off_t is
> > 32-bit on 32-bit x86, unless compiled with -D_FILE_OFFSET_BITS=64. This
> > macro should be used for all new code. Distros should enable it for all
> > builds, but when one builds turbostat 'by hand', they hit the issue.
> >
> >> would we really have to invoke syscall() directly -- couldn't we
> >> invoke pread64() directly? (eg. below)
> >
> > No, the proper fix is to pass -D_FILE_OFFSET_BITS=64 to the compiler.
> >
> > Here's the patch:
>
> This path works with my case.
>
> Thanks,
> Liwei.
>
>
> >
> > ---8<---
> >
> > From: Alexander Monakov 
> > Date: Sun, 23 Aug 2020 23:27:02 +0300
> > Subject: [PATCH] turbostat: build with _FILE_OFFSET_BITS=64
> >
> > For compatibility reasons, Glibc off_t is a 32-bit type on 32-bit x86
> > unless _FILE_OFFSET_BITS=64 is defined. Add this define, as otherwise
> > reading MSRs with index 0x8000 and above attempts a pread with a
> > negative offset, which fails.
> >
> > Signed-off-by: Alexander Monakov 
> > ---
> >  tools/power/x86/turbostat/Makefile | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/tools/power/x86/turbostat/Makefile 
> > b/tools/power/x86/turbostat/Makefile
> > index 2b6551269e43..40ae44402eec 100644
> > --- a/tools/power/x86/turbostat/Makefile
> > +++ b/tools/power/x86/turbostat/Makefile
> > @@ -12,6 +12,7 @@ turbostat : turbostat.c
> >  override CFLAGS +=   -O2 -Wall -I../../../include
> >  override CFLAGS +=   
> > -DMSRHEADER='"../../../../arch/x86/include/asm/msr-index.h"'
> >  override CFLAGS +=   
> > -DINTEL_FAMILY_HEADER='"../../../../arch/x86/include/asm/intel-family.h"'
> > +override CFLAGS +=   -D_FILE_OFFSET_BITS=64
> >  override CFLAGS +=   -D_FORTIFY_SOURCE=2
> >
> >  %: %.c
> >



-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: call pread64 in kernel directly

2020-08-21 Thread Len Brown

Re: offset size

The offsets on this file are the MSR offsets.
What MSR are you trying to access at offset 0xc0010299?

Re: pread vs pread64

If I take on faith that you have some kind of 32-bit execution
environment that makes pread into pread32 instead of pread64, and that
truncates an off_t to 32-bits from 64-bits, and it actually makes
sense to request a read at this large offset...

would we really have to invoke syscall() directly -- couldn't we
invoke pread64() directly? (eg. below)

thanks,
-Len


@@ -490,11 +491,12 @@ int get_msr_fd(int cpu)
return fd;
 }

-int get_msr(int cpu, off_t offset, unsigned long long *msr)
+int get_msr(int cpu, unsigned long long offset, unsigned long long *msr)
 {
ssize_t retval;

-   retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset);
+   retval = pread64(get_msr_fd(cpu), msr, sizeof(*msr), offset);

if (retval != sizeof *msr)
err(-1, "cpu%d: msr offset 0x%llx read failed", cpu,
(unsigned long long)offset);

On Thu, Aug 13, 2020 at 10:17 PM Liwei Song  wrote:
>
> with multilib lib32 support, the rootfs will be 32-bit,
> the kernel is still 64-bit, in this case run turbostat
> will failed with "out of range" error.
>
> Thanks,
> Liwei.
>
> On 8/14/20 05:43, Len Brown wrote:
> > Huh?
> >
> > On Fri, Jul 17, 2020 at 2:09 AM Liwei Song  wrote:
> >>
> >> with 32-bit rootfs, the offset may out of range when set it
> >> to 0xc0010299, define it as "unsigned long long" type and
> >> call pread64 directly in kernel.
> >>
> >> Signed-off-by: Liwei Song 
> >> ---
> >>  tools/power/x86/turbostat/turbostat.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/power/x86/turbostat/turbostat.c 
> >> b/tools/power/x86/turbostat/turbostat.c
> >> index 33b370865d16..4c5cdfcb5721 100644
> >> --- a/tools/power/x86/turbostat/turbostat.c
> >> +++ b/tools/power/x86/turbostat/turbostat.c
> >> @@ -33,6 +33,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>
> >>  char *proc_stat = "/proc/stat";
> >>  FILE *outf;
> >> @@ -381,11 +382,11 @@ int get_msr_fd(int cpu)
> >> return fd;
> >>  }
> >>
> >> -int get_msr(int cpu, off_t offset, unsigned long long *msr)
> >> +int get_msr(int cpu, unsigned long long offset, unsigned long long *msr)
> >>  {
> >> ssize_t retval;
> >>
> >> -   retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset);
> >> +   retval = syscall(SYS_pread64, get_msr_fd(cpu), msr, sizeof(*msr), 
> >> offset);
> >>
> >> if (retval != sizeof *msr)
> >> err(-1, "cpu%d: msr offset 0x%llx read failed", cpu, 
> >> (unsigned long long)offset);
> >> --
> >> 2.17.1
> >>
> >
> >



-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: fix output formatting for ACPI CST enumeration

2020-08-21 Thread Len Brown

Hi Dave,

I think this is fine.
Indeed, I actually went ahead and applied it a week or so ago.

the only alternative that I can think of is actually shortening the
ACPI C-state names in the intel_idle driver -- which is still an
option.  It would not be the first time we have tweaked the names used
in that driver to make tools more happy...

My apology for neglecting to send you an ACK.
I had intended to send my queued series to the list, which would
suffice for all the ACKs, but that send and the subsequent push got
delayed by this and that.  So I'll try to ack as I go, so it is clear
at any time where a patch stands.

thanks!

Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: Support Sapphire Rapids

2020-08-21 Thread Len Brown

Already have this one.

thanks!

Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: Support AMD Family 19h

2020-08-21 Thread Len Brown

Applied.

thanks!
Len Brown, Intel Open Source Technology Center

Re: [PATCH 2/2][RFC] tools/power turbostat: Introduce reliable RAPL display

2020-08-13 Thread Len Brown

 return -16;
> +   p->rapl_pkg_perf_status = msr & 0x;
> +   }
> }
> if (do_rapl & RAPL_DRAM_PERF_STATUS) {
> -   if (get_msr(cpu, MSR_DRAM_PERF_STATUS, ))
> -   return -16;
> -   p->rapl_dram_perf_status = msr & 0x;
> +   if (longtime) {
> +   if (get_msr_sum(cpu, MSR_DRAM_PERF_STATUS, ))
> +   return -16;
> +   p->rapl_dram_perf_status = msr;
> +   } else {
> +   if (get_msr(cpu, MSR_DRAM_PERF_STATUS, ))
> +   return -16;
> +   p->rapl_dram_perf_status = msr & 0x;
> +   }
> }
> if (do_rapl & RAPL_AMD_F17H) {
> if (get_msr(cpu, MSR_PKG_ENERGY_STAT, ))
> @@ -3053,6 +3198,109 @@ void do_sleep(void)
> }
>  }
>
> +int get_msr_sum(int cpu, off_t offset, unsigned long long *msr)
> +{
> +   int ret, idx;
> +   unsigned long long msr_cur, msr_last;
> +
> +   if (!per_cpu_msr_sum)
> +   return 1;
> +
> +   idx = offset_to_idx(offset);
> +   if (idx < 0)
> +   return idx;
> +   /* get_msr_sum() = sum + (get_msr() - last) */
> +   ret = get_msr(cpu, offset, _cur);
> +   if (ret)
> +   return ret;
> +   msr_last = per_cpu_msr_sum[cpu].entries[idx].last;
> +   DELTA_WRAP32(msr_cur, msr_last);
> +   *msr = msr_last + per_cpu_msr_sum[cpu].entries[idx].sum;
> +
> +   return 0;
> +}
> +
> +timer_t timerid;
> +
> +/* Timer callback, update the sum of MSRs periodically. */
> +static int update_msr_sum(struct thread_data *t, struct core_data *c, struct 
> pkg_data *p)
> +{
> +   int i, ret;
> +   int cpu = t->cpu_id;
> +
> +   for (i = IDX_PKG_ENERGY; i < IDX_COUNT; i++) {
> +   unsigned long long msr_cur, msr_last;
> +   int offset;
> +
> +   if (!idx_valid(i))
> +   continue;
> +   offset = idx_to_offset(i);
> +   if (offset < 0)
> +   continue;
> +   ret = get_msr(cpu, offset, _cur);
> +   if (ret) {
> +   fprintf(outf, "Can not update msr(0x%x)\n", offset);
> +   continue;
> +   }
> +
> +   msr_last = per_cpu_msr_sum[cpu].entries[i].last;
> +   per_cpu_msr_sum[cpu].entries[i].last = msr_cur & 0x;
> +
> +   DELTA_WRAP32(msr_cur, msr_last);
> +   per_cpu_msr_sum[cpu].entries[i].sum += msr_last;
> +   }
> +   return 0;
> +}
> +
> +static void
> +msr_record_handler(union sigval v)
> +{
> +   for_all_cpus(update_msr_sum, EVEN_COUNTERS);
> +}
> +
> +void msr_longtime_record(void)
> +{
> +   struct itimerspec its;
> +   struct sigevent sev;
> +
> +   per_cpu_msr_sum = calloc(topo.max_cpu_num + 1, sizeof(struct 
> msr_sum_array));
> +   if (!per_cpu_msr_sum) {
> +   fprintf(outf, "Can not allocate memory for long time MSR.\n");
> +   return;
> +   }
> +   /*
> +* Signal handler might be restricted, so use thread notifier instead.
> +*/
> +   memset(, 0, sizeof(struct sigevent));
> +   sev.sigev_notify = SIGEV_THREAD;
> +   sev.sigev_notify_function = msr_record_handler;
> +
> +   sev.sigev_value.sival_ptr = 
> +   if (timer_create(CLOCK_REALTIME, , ) == -1) {
> +   fprintf(outf, "Can not create timer.\n");
> +   goto release_msr;
> +   }
> +
> +   its.it_value.tv_sec = 0;
> +   its.it_value.tv_nsec = 1;
> +   /*
> +* A wraparound time of around 60 secs when power consumption
> +* is high, use 50 secs.
> +*/
> +   its.it_interval.tv_sec = 50;
> +   its.it_interval.tv_nsec = 0;
> +
> +   if (timer_settime(timerid, 0, , NULL) == -1) {
> +   fprintf(outf, "Can not set timer.\n");
> +   goto release_timer;
> +   }
> +   return;
> +
> + release_timer:
> +   timer_delete(timerid);
> + release_msr:
> +   free(per_cpu_msr_sum);
> +}
>
>  void turbostat_loop()
>  {
> @@ -5735,6 +5983,7 @@ void cmdline(int argc, char **argv)
> {"hide",required_argument,  0, 'H'},// 
> meh, -h taken by --help
>     {"Joules",  no_argument,0, 'J'},
> {"list",no_argument,0, 'l'},
> +   {"Longtime",no_argument,0, 'L'},
> {"out", required_argument,  0, 'o'},
> {"quiet",   no_argument,0, 'q'},
> {"show",required_argument,  0, 's'},
> @@ -5746,7 +5995,7 @@ void cmdline(int argc, char **argv)
>
> progname = argv[0];
>
> -   while ((opt = getopt_long_only(argc, argv, "+C:c:Dde:hi:Jn:o:qST:v",
> +   while ((opt = getopt_long_only(argc, argv, "+C:c:Dde:hi:JLn:o:qST:v",
> long_options, _index)) != -1) {
> switch (opt) {
> case 'a':
> @@ -5800,6 +6049,9 @@ void cmdline(int argc, char **argv)
> list_header_only++;
> quiet++;
> break;
> +   case 'L':
> +   longtime = 1;
> +   break;
> case 'o':
> outf = fopen_or_die(optarg, "w");
> break;
> @@ -5864,6 +6116,8 @@ int main(int argc, char **argv)
> return 0;
> }
>
> +   if (longtime)
> +   msr_longtime_record();
> /*
>  * if any params left, it must be a command to fork
>  */
> --
> 2.17.1
>


-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: call pread64 in kernel directly

2020-08-13 Thread Len Brown

Huh?

On Fri, Jul 17, 2020 at 2:09 AM Liwei Song  wrote:
>
> with 32-bit rootfs, the offset may out of range when set it
> to 0xc0010299, define it as "unsigned long long" type and
> call pread64 directly in kernel.
>
> Signed-off-by: Liwei Song 
> ---
>  tools/power/x86/turbostat/turbostat.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/tools/power/x86/turbostat/turbostat.c 
> b/tools/power/x86/turbostat/turbostat.c
> index 33b370865d16..4c5cdfcb5721 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  char *proc_stat = "/proc/stat";
>  FILE *outf;
> @@ -381,11 +382,11 @@ int get_msr_fd(int cpu)
> return fd;
>  }
>
> -int get_msr(int cpu, off_t offset, unsigned long long *msr)
> +int get_msr(int cpu, unsigned long long offset, unsigned long long *msr)
>  {
> ssize_t retval;
>
> -   retval = pread(get_msr_fd(cpu), msr, sizeof(*msr), offset);
> +   retval = syscall(SYS_pread64, get_msr_fd(cpu), msr, sizeof(*msr), 
> offset);
>
> if (retval != sizeof *msr)
> err(-1, "cpu%d: msr offset 0x%llx read failed", cpu, 
> (unsigned long long)offset);
> --
> 2.17.1
>


-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power/x86/turbostat: Always print idle in the system configuration header

2020-08-13 Thread Len Brown

Applied.

thanks!
-Len

On Thu, Mar 26, 2020 at 4:36 PM Doug Smythies  wrote:
>
> If the --quiet option is not used, turbostat prints a useful system
> configuration header during startup. Inclusion of idle system configuration
> information is a function of inclusion in the columns choosen to be displayed.
>
> Always list the idle system configuration.
>
> Signed-off-by: Doug Smythies 
> ---
>  tools/power/x86/turbostat/turbostat.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/tools/power/x86/turbostat/turbostat.c 
> b/tools/power/x86/turbostat/turbostat.c
> index 33b370865d16..834b86676d00 100644
> --- a/tools/power/x86/turbostat/turbostat.c
> +++ b/tools/power/x86/turbostat/turbostat.c
> @@ -3530,9 +3530,6 @@ dump_sysfs_cstate_config(void)
> int state;
> char *sp;
>
> -   if (!DO_BIC(BIC_sysfs))
> -   return;
> -
> if (access("/sys/devices/system/cpu/cpuidle", R_OK)) {
> fprintf(outf, "cpuidle not loaded\n");
> return;
> --
> 2.25.1
>


-- 
Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power turbostat: Make interval calculation per thread to reduce jitter

2019-08-31 Thread Len Brown

Yeah, I like this patch, and I have applied it.

thanks!
-Len

On Fri, Jun 7, 2019 at 12:28 PM Ghannam, Yazen  wrote:
>
> > -Original Message-
> > From: Ghannam, Yazen 
> > Sent: Tuesday, April 23, 2019 12:53 PM
> > To: Ghannam, Yazen ; linux...@vger.kernel.org; 
> > len.br...@intel.com
> > Cc: linux-kernel@vger.kernel.org; Len Brown 
> > Subject: RE: [PATCH] tools/power turbostat: Make interval calculation per 
> > thread to reduce jitter
> >
> > > -Original Message-
> > > From: linux-kernel-ow...@vger.kernel.org 
> > >  On Behalf Of Ghannam, Yazen
> > > Sent: Monday, March 25, 2019 12:33 PM
> > > To: linux...@vger.kernel.org
> > > Cc: Ghannam, Yazen ; linux-kernel@vger.kernel.org; 
> > > l...@kernel.org
> > > Subject: [PATCH] tools/power turbostat: Make interval calculation per 
> > > thread to reduce jitter
> > >
> > > From: Yazen Ghannam 
> > >
> > > Turbostat currently normalizes TSC and other values by dividing by an
> > > interval. This interval is the delta between the start of one global
> > > (all counters on all CPUs) sampling and the start of another. However,
> > > this introduces a lot of jitter into the data.
> > >
> > > In order to reduce jitter, the interval calculation should be based on
> > > timestamps taken per thread and close to the start of the thread's
> > > sampling.
> > >
> > > Define a per thread time value to hold the delta between samples taken
> > > on the thread.
> > >
> > > Use the timestamp taken at the beginning of sampling to calculate the
> > > delta.
> > >
> > > Move the thread's beginning timestamp to after the CPU migration to
> > > avoid jitter due to the migration.
> > >
> > > Use the global time delta for the average time delta.
> > >
> > > Signed-off-by: Yazen Ghannam 
> > > ---
> >
> > Hi Len,
> >
> > Any comments on this patch?
> >
>
> Hi Len,
>
> Just wanted to check in. Do you have any comments on this patch?
>
> Thanks,
> Yazen



-- 
Len Brown, Intel Open Source Technology Center

[tip:x86/topology] hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  835896a59b9577d0bc2131e027c37bdde5b979af
Gitweb: https://git.kernel.org/tip/835896a59b9577d0bc2131e027c37bdde5b979af
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:59:01 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:36 +0200

hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names to
use the more generic thermal "zone" terminology, instead of "package", as
the zones can refer to either packages or die.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Cc: Zhang Rui 
Link: 
https://lkml.kernel.org/r/facecfd3525d55c2051f63a7ec709aeb03cc1dc1.1557769318.git.len.br...@intel.com

---
 drivers/hwmon/coretemp.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index c64ce32d3214..4ebed90d81aa 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -109,10 +109,10 @@ struct platform_data {
struct device_attribute name_attr;
 };
 
-/* Keep track of how many package pointers we allocated in init() */
-static int max_packages __read_mostly;
-/* Array of package pointers. Serialized by cpu hotplug lock */
-static struct platform_device **pkg_devices;
+/* Keep track of how many zone pointers we allocated in init() */
+static int max_zones __read_mostly;
+/* Array of zone pointers. Serialized by cpu hotplug lock */
+static struct platform_device **zone_devices;
 
 static ssize_t show_label(struct device *dev,
struct device_attribute *devattr, char *buf)
@@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu)
 
 static struct platform_device *coretemp_get_pdev(unsigned int cpu)
 {
-   int pkgid = topology_logical_die_id(cpu);
+   int id = topology_logical_die_id(cpu);
 
-   if (pkgid >= 0 && pkgid < max_packages)
-   return pkg_devices[pkgid];
+   if (id >= 0 && id < max_zones)
+   return zone_devices[id];
return NULL;
 }
 
@@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev)
struct device *dev = >dev;
struct platform_data *pdata;
 
-   /* Initialize the per-package data structures */
+   /* Initialize the per-zone data structures */
pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL);
if (!pdata)
return -ENOMEM;
@@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = {
 
 static struct platform_device *coretemp_device_add(unsigned int cpu)
 {
-   int err, pkgid = topology_logical_die_id(cpu);
+   int err, zoneid = topology_logical_die_id(cpu);
struct platform_device *pdev;
 
-   if (pkgid < 0)
+   if (zoneid < 0)
return ERR_PTR(-ENOMEM);
 
-   pdev = platform_device_alloc(DRVNAME, pkgid);
+   pdev = platform_device_alloc(DRVNAME, zoneid);
if (!pdev)
return ERR_PTR(-ENOMEM);
 
@@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned 
int cpu)
return ERR_PTR(err);
}
 
-   pkg_devices[pkgid] = pdev;
+   zone_devices[zoneid] = pdev;
return pdev;
 }
 
@@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu)
 * the rest.
 */
if (cpumask_empty(>cpumask)) {
-   pkg_devices[topology_logical_die_id(cpu)] = NULL;
+   zone_devices[topology_logical_die_id(cpu)] = NULL;
platform_device_unregister(pdev);
return 0;
}
@@ -741,10 +741,10 @@ static int __init coretemp_init(void)
if (!x86_match_cpu(coretemp_ids))
return -ENODEV;
 
-   max_packages = topology_max_packages() * topology_max_die_per_package();
-   pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *),
+   max_zones = topology_max_packages() * topology_max_die_per_package();
+   zone_devices = kcalloc(max_zones, sizeof(struct platform_device *),
  GFP_KERNEL);
-   if (!pkg_devices)
+   if (!zone_devices)
return -ENOMEM;
 
err = platform_driver_register(_driver);
@@ -760,7 +760,7 @@ static int __init coretemp_init(void)
 
 outdrv:
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
return err;
 }
 module_init(coretemp_init)
@@ -769,7 +769,7 @@ static void __exit coretemp_exit(void)
 {
cpuhp_remove_state(coretemp_hp_online);
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
 }
 module_exit(coretemp_exit)

[tip:x86/topology] thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  b2ce1c883df91a231f8138935167273c1767ad66
Gitweb: https://git.kernel.org/tip/b2ce1c883df91a231f8138935167273c1767ad66
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:59:00 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:36 +0200

thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from 
packages

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names to
use the more generic thermal "zone" terminology, instead of "package", as
the zones can refer to either packages or die.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Cc: Zhang Rui 
Link: 
https://lkml.kernel.org/r/b65494a76be13481dc3a809c75debb2574c34eda.1557769318.git.len.br...@intel.com

---
 drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++-
 1 file changed, 72 insertions(+), 70 deletions(-)

diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c 
b/drivers/thermal/intel/x86_pkg_temp_thermal.c
index 405b3858900a..87e929ffb0cb 100644
--- a/drivers/thermal/intel/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c
@@ -55,7 +55,7 @@ MODULE_PARM_DESC(notify_delay_ms,
 */
 #define MAX_NUMBER_OF_TRIPS2
 
-struct pkg_device {
+struct zone_device {
int cpu;
boolwork_scheduled;
u32 tj_max;
@@ -70,10 +70,10 @@ static struct thermal_zone_params pkg_temp_tz_params = {
.no_hwmon   = true,
 };
 
-/* Keep track of how many package pointers we allocated in init() */
-static int max_packages __read_mostly;
-/* Array of package pointers */
-static struct pkg_device **packages;
+/* Keep track of how many zone pointers we allocated in init() */
+static int max_id __read_mostly;
+/* Array of zone pointers */
+static struct zone_device **zones;
 /* Serializes interrupt notification, work and hotplug */
 static DEFINE_SPINLOCK(pkg_temp_lock);
 /* Protects zone operation in the work function against hotplug removal */
@@ -120,12 +120,12 @@ err_out:
  *
  * - Other callsites: Must hold pkg_temp_lock
  */
-static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu)
+static struct zone_device *pkg_temp_thermal_get_dev(unsigned int cpu)
 {
-   int pkgid = topology_logical_die_id(cpu);
+   int id = topology_logical_die_id(cpu);
 
-   if (pkgid >= 0 && pkgid < max_packages)
-   return packages[pkgid];
+   if (id >= 0 && id < max_id)
+   return zones[id];
return NULL;
 }
 
@@ -150,12 +150,13 @@ static int get_tj_max(int cpu, u32 *tj_max)
 
 static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
u32 eax, edx;
 
-   rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, , );
+   rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_STATUS,
+   , );
if (eax & 0x8000) {
-   *temp = pkgdev->tj_max - ((eax >> 16) & 0x7f) * 1000;
+   *temp = zonedev->tj_max - ((eax >> 16) & 0x7f) * 1000;
pr_debug("sys_get_curr_temp %d\n", *temp);
return 0;
}
@@ -165,7 +166,7 @@ static int sys_get_curr_temp(struct thermal_zone_device 
*tzd, int *temp)
 static int sys_get_trip_temp(struct thermal_zone_device *tzd,
 int trip, int *temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
unsigned long thres_reg_value;
u32 mask, shift, eax, edx;
int ret;
@@ -181,14 +182,14 @@ static int sys_get_trip_temp(struct thermal_zone_device 
*tzd,
shift = THERM_SHIFT_THRESHOLD0;
}
 
-   ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
+   ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
   , );
if (ret < 0)
return ret;
 
thres_reg_value = (eax & mask) >> shift;
if (thres_reg_value)
-   *temp = pkgdev->tj_max - thres_reg_value * 1000;
+   *temp = zonedev->tj_max - thres_reg_value * 1000;
else
*temp = 0;
pr_debug("sys_get_trip_temp %d\n", *temp);
@@ -199,14 +200,14 @@ static int sys_get_trip_temp(struct thermal_zone_device 
*tzd,
 static int
 sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
u32 l, h, mask, shift, intr;
int ret;
 
-   if (trip >= MAX_NUMBER_OF_TRIPS || temp >

[tip:x86/topology] topology: Create core_cpus and die_cpus sysfs attributes

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  2e4c54dac7b360c3820399bdf06cde9134a4495b
Gitweb: https://git.kernel.org/tip/2e4c54dac7b360c3820399bdf06cde9134a4495b
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:56 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:34 +0200

topology: Create core_cpus and die_cpus sysfs attributes

Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list"

These attributes represent all of the logical CPUs that share the
same core.

These attriutes is synonymous with the existing "thread_siblings" and
"thread_siblings_list" attribute, which will be deprecated.

Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list".
These attributes represent all of the logical CPUs that share the
same die.

Suggested-by: Brice Goglin 
Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/071c23a298cd27ede6ed0b6460cae190d193364f.1557769318.git.len.br...@intel.com

---
 Documentation/cputopology.txt   | 21 +++--
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 22 ++
 arch/x86/xen/smp_pv.c   |  1 +
 drivers/base/topology.c | 12 
 include/linux/topology.h|  3 +++
 7 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 48af5c290e20..b90dafcc8237 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -36,15 +36,15 @@ drawer_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
-thread_siblings:
+core_cpus:
 
-   internal kernel map of cpuX's hardware threads within the same
-   core as cpuX.
+   internal kernel map of CPUs within the same core.
+   (deprecated name: "thread_siblings")
 
-thread_siblings_list:
+core_cpus_list:
 
-   human-readable list of cpuX's hardware threads within the same
-   core as cpuX.
+   human-readable list of CPUs within the same core.
+   (deprecated name: "thread_siblings_list");
 
 package_cpus:
 
@@ -56,6 +56,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+die_cpus:
+
+   internal kernel map of CPUs within the same die.
+
+die_cpus_list:
+
+   human-readable list of CPUs within the same die.
+
 book_siblings:
 
internal kernel map of cpuX's hardware threads within the same
@@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h::
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df207b2..b673a226ad6c 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -23,6 +23,7 @@ extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9de16b4f6023..4b14d2318251 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
 #ifdef CONFIG_SMP
+#define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a6e01b6c2709..1a19a5171949 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
+/* representing HT, core, and die siblings of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
+EXPORT_PER_CPU_SYMBOL(cpu_die_map);
+
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
 /* Per CPU bogomips and other parameters */
@@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
return false;
 }
 
+static bool match_die(struct cpuinfo_x86 *c, struct

[tip:x86/topology] topology: Create package_cpus sysfs attribute

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  b73ed8dc0597c11ec5064d06b9bbd4e541b6d4e7
Gitweb: https://git.kernel.org/tip/b73ed8dc0597c11ec5064d06b9bbd4e541b6d4e7
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:55 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:34 +0200

topology: Create package_cpus sysfs attribute

The existing sysfs cpu/topology/core_siblings (and core_siblings_list)
attributes are documented, implemented, and used by programs to represent
set of logical CPUs sharing the same package.

This makes sense if the next topology level above a core is always a
package.  But on systems where there is a die topology level between a core
and a package, the name and its definition become inconsistent.

So without changing its function, add a name for this map that describes
what it actually is -- package CPUs -- the set of CPUs that share the same
package.

This new name will be immune to changes in topology, since it describes
threads at the current level, not siblings at a contained level.

Suggested-by: Brice Goglin 
Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/d9d3228b82fb5665e6f93a0ccd033fe022558521.1557769318.git.len.br...@intel.com

---
 Documentation/cputopology.txt | 12 ++--
 drivers/base/topology.c   |  6 ++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 2ff8a1e9a2db..48af5c290e20 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -46,15 +46,15 @@ thread_siblings_list:
human-readable list of cpuX's hardware threads within the same
core as cpuX.
 
-core_siblings:
+package_cpus:
 
-   internal kernel map of cpuX's hardware threads within the same
-   physical_package_id.
+   internal kernel map of the CPUs sharing the same physical_package_id.
+   (deprecated name: "core_siblings")
 
-core_siblings_list:
+package_cpus_list:
 
-   human-readable list of cpuX's hardware threads within the same
-   physical_package_id.
+   human-readable list of CPUs sharing the same physical_package_id.
+   (deprecated name: "core_siblings_list")
 
 book_siblings:
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 50352cf96f85..dc3c19b482f3 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -57,6 +57,10 @@ define_siblings_show_func(core_siblings, core_cpumask);
 static DEVICE_ATTR_RO(core_siblings);
 static DEVICE_ATTR_RO(core_siblings_list);
 
+define_siblings_show_func(package_cpus, core_cpumask);
+static DEVICE_ATTR_RO(package_cpus);
+static DEVICE_ATTR_RO(package_cpus_list);
+
 #ifdef CONFIG_SCHED_BOOK
 define_id_show_func(book_id);
 static DEVICE_ATTR_RO(book_id);
@@ -81,6 +85,8 @@ static struct attribute *default_attrs[] = {
_attr_thread_siblings_list.attr,
_attr_core_siblings.attr,
_attr_core_siblings_list.attr,
+   _attr_package_cpus.attr,
+   _attr_package_cpus_list.attr,
 #ifdef CONFIG_SCHED_BOOK
_attr_book_id.attr,
_attr_book_siblings.attr,

[tip:x86/topology] x86/topology: Define topology_logical_die_id()

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac
Gitweb: https://git.kernel.org/tip/212bf4fdb7f9eeeb99afd97ebad677d43e7b55ac
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:49 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:32 +0200

x86/topology: Define topology_logical_die_id()

Define topology_logical_die_id() ala existing topology_logical_package_id()

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Tested-by: Zhang Rui 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/2f3526e25ae14fbeff26fb26e877d159df8946d9.1557769318.git.len.br...@intel.com

---
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/include/asm/topology.h  |  5 +
 arch/x86/kernel/cpu/common.c |  1 +
 arch/x86/kernel/smpboot.c| 45 
 4 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7c17343946dd..6aba36bde57f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -118,6 +118,7 @@ struct cpuinfo_x86 {
/* Core id: */
u16 cpu_core_id;
u16 cpu_die_id;
+   u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
u32 microcode;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 3777dbe9c0ff..9de16b4f6023 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
+#define topology_logical_die_id(cpu)   (cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)   (cpu_data(cpu).cpu_die_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
@@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void)
 }
 
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
+int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
+int topology_phys_to_logical_die(unsigned int die, unsigned int cpu);
 bool topology_is_primary_thread(unsigned int cpu);
 bool topology_smt_supported(void);
 #else
 #define topology_max_packages()(1)
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
+static inline int
+topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_phys_to_logical_die(unsigned int die,
unsigned int cpu) { return 0; }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d7f55ad2dfb1..8cdca1223b0f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1298,6 +1298,7 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
   cpu, apicid, c->initial_apicid);
}
BUG_ON(topology_update_package_map(c->phys_proc_id, cpu));
+   BUG_ON(topology_update_die_map(c->cpu_die_id, cpu));
 #else
c->logical_proc_id = 0;
 #endif
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 40ffe23249c0..a6e01b6c2709 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 static unsigned int logical_packages __read_mostly;
+static unsigned int logical_die __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __read_mostly __max_smt_threads = 1;
@@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg)
return -1;
 }
 EXPORT_SYMBOL(topology_phys_to_logical_pkg);
+/**
+ * topology_phys_to_logical_die - Map a physical die id to logical
+ *
+ * Returns logical die id or -1 if not found
+ */
+int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
+{
+   int cpu;
+   int proc_id = cpu_data(cur_cpu).phys_proc_id;
+
+   for_each_possible_cpu(cpu) {
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   if (c->initialized && c->cpu_die_id == die_id &&
+   c->phys_proc_id == proc_id)
+   return c->logical_die_id;
+   }
+   return -1;
+}
+EXPORT_SYMBOL(topology_phys_to_logical_die);
 
 /**
  * topology_update_package_map - Update the physical to logical package map
@@ -326,6 +347,29 @@ found:
cpu_data(cpu).logical_proc_id = new;
return 0;
 }
+/**
+ * topol

[tip:x86/topology] x86/topology: Define topology_die_id()

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  306a0de329f77537f29022c2982006f9145d782d
Gitweb: https://git.kernel.org/tip/306a0de329f77537f29022c2982006f9145d782d
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:48 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:31 +0200

x86/topology: Define topology_die_id()

topology_die_id(cpu) is a simple macro for use inside the kernel to get the
die_id associated with the given cpu.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/6463bc422b1b05445a502dc505c1d7c6756bda6a.1557769318.git.len.br...@intel.com

---
 arch/x86/include/asm/topology.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index e0232f7042c3..3777dbe9c0ff 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
+#define topology_die_id(cpu)   (cpu_data(cpu).cpu_die_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
 #ifdef CONFIG_SMP

[tip:x86/topology] cpu/topology: Export die_id

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  0e344d8c709fe01d882fc0fb5452bedfe5eba67a
Gitweb: https://git.kernel.org/tip/0e344d8c709fe01d882fc0fb5452bedfe5eba67a
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:47 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:31 +0200

cpu/topology: Export die_id

Export die_id in cpu topology, for the benefit of hardware that has
multiple-die/package.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Cc: linux-...@vger.kernel.org
Link: 
https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.br...@intel.com

---
 Documentation/cputopology.txt | 15 ---
 drivers/base/topology.c   |  4 
 include/linux/topology.h  |  3 +++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index cb61277e2308..2ff8a1e9a2db 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -12,6 +12,12 @@ physical_package_id:
socket number, but the actual value is architecture and platform
dependent.
 
+die_id:
+
+   the CPU die ID of cpuX. Typically it is the hardware platform's
+   identifier (rather than the kernel's).  The actual value is
+   architecture and platform dependent.
+
 core_id:
 
the CPU core ID of cpuX. Typically it is the hardware platform's
@@ -81,6 +87,7 @@ For an architecture to support this feature, it must define 
some of
 these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
+   #define topology_die_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
@@ -99,9 +106,11 @@ provides default definitions for any of the above macros 
that are
 not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
-2) topology_core_id: 0
-3) topology_sibling_cpumask: just the given CPU
-4) topology_core_cpumask: just the given CPU
+2) topology_die_id: -1
+3) topology_core_id: 0
+4) topology_sibling_cpumask: just the given CPU
+5) topology_core_cpumask: just the given CPU
+6) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 5fd9f167ecc1..50352cf96f85 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -43,6 +43,9 @@ static ssize_t name##_list_show(struct device *dev,   
\
 define_id_show_func(physical_package_id);
 static DEVICE_ATTR_RO(physical_package_id);
 
+define_id_show_func(die_id);
+static DEVICE_ATTR_RO(die_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -72,6 +75,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list);
 
 static struct attribute *default_attrs[] = {
_attr_physical_package_id.attr,
+   _attr_die_id.attr,
_attr_core_id.attr,
_attr_thread_siblings.attr,
_attr_thread_siblings_list.attr,
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e1ee4b..5cc8595dd0e4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -184,6 +184,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_physical_package_id
 #define topology_physical_package_id(cpu)  ((void)(cpu), -1)
 #endif
+#ifndef topology_die_id
+#define topology_die_id(cpu)   ((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)  ((void)(cpu), 0)
 #endif

[tip:x86/topology] x86/topology: Create topology_max_die_per_package()

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  14d96d6c06b5d8116b8d52c9c5530f5528ef1e61
Gitweb: https://git.kernel.org/tip/14d96d6c06b5d8116b8d52c9c5530f5528ef1e61
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:46 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:30 +0200

x86/topology: Create topology_max_die_per_package()

topology_max_packages() is available to size resources to cover all
packages in the system.

But now multi-die/package systems are coming up, and some resources are
per-die.

Create topology_max_die_per_package(), for detecting multi-die/package
systems, and sizing any per-die resources.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/e6eaf384571ae52ac7d0ca41510b7fb7d2fda0e4.1557769318.git.len.br...@intel.com

---
 arch/x86/include/asm/processor.h |  1 -
 arch/x86/include/asm/topology.h  | 10 ++
 arch/x86/kernel/cpu/topology.c   |  5 -
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index ef0a44fccaec..7c17343946dd 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -106,7 +106,6 @@ struct cpuinfo_x86 {
unsigned long   loops_per_jiffy;
/* cpuid returned max cores value: */
u16 x86_max_cores;
-   u16 x86_max_dies;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38a1c33..e0232f7042c3 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
 
+extern unsigned int __max_die_per_package;
+
+static inline int topology_max_die_per_package(void)
+{
+   return __max_die_per_package;
+}
+
 extern int __max_smt_threads;
 
 static inline int topology_max_smt_threads(void)
@@ -131,6 +138,9 @@ bool topology_smt_supported(void);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_phys_to_logical_die(unsigned int die,
+   unsigned int cpu) { return 0; }
+static inline int topology_max_die_per_package(void) { return 1; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
 static inline bool topology_smt_supported(void) { return false; }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 4d17e699657d..ee48c3fc8a65 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -26,6 +26,9 @@
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
 #ifdef CONFIG_SMP
+unsigned int __max_die_per_package __read_mostly = 1;
+EXPORT_SYMBOL(__max_die_per_package);
+
 /*
  * Check if given CPUID extended toplogy "leaf" is implemented
  */
@@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
 
c->x86_max_cores = (core_level_siblings / smp_num_siblings);
-   c->x86_max_dies = (die_level_siblings / core_level_siblings);
+   __max_die_per_package = (die_level_siblings / core_level_siblings);
 #endif
return 0;
 }

[tip:x86/topology] x86/topology: Add CPUID.1F multi-die/package support

2019-05-23 Thread tip-bot for Len Brown

Commit-ID:  7745f03eb39587dd15a1fb26e6223678b8e906d2
Gitweb: https://git.kernel.org/tip/7745f03eb39587dd15a1fb26e6223678b8e906d2
Author: Len Brown 
AuthorDate: Mon, 13 May 2019 13:58:45 -0400
Committer:  Thomas Gleixner 
CommitDate: Thu, 23 May 2019 10:08:30 +0200

x86/topology: Add CPUID.1F multi-die/package support

Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle
either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.

Signed-off-by: Len Brown 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Ingo Molnar 
Acked-by: Peter Zijlstra (Intel) 
Cc: linux-...@vger.kernel.org
Link: 
https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.br...@intel.com

---
 Documentation/x86/topology.rst   |  4 ++
 arch/x86/include/asm/processor.h |  4 +-
 arch/x86/kernel/cpu/topology.c   | 85 +++-
 arch/x86/kernel/smpboot.c|  2 +
 4 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/Documentation/x86/topology.rst b/Documentation/x86/topology.rst
index 6e28dbe818ab..8e9704f61017 100644
--- a/Documentation/x86/topology.rst
+++ b/Documentation/x86/topology.rst
@@ -49,6 +49,10 @@ Package-related topology information in the kernel:
 
 The number of cores in a package. This information is retrieved via CPUID.
 
+  - cpuinfo_x86.x86_max_dies:
+
+The number of dies in a package. This information is retrieved via CPUID.
+
   - cpuinfo_x86.phys_proc_id:
 
 The physical ID of the package. This information is retrieved via CPUID
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c34a35c78618..ef0a44fccaec 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -105,7 +105,8 @@ struct cpuinfo_x86 {
int x86_power;
unsigned long   loops_per_jiffy;
/* cpuid returned max cores value: */
-   u16  x86_max_cores;
+   u16 x86_max_cores;
+   u16 x86_max_dies;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
@@ -117,6 +118,7 @@ struct cpuinfo_x86 {
u16 logical_proc_id;
/* Core id: */
u16 cpu_core_id;
+   u16 cpu_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
u32 microcode;
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 8f6c784141d1..4d17e699657d 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -15,33 +15,63 @@
 /* leaf 0xb SMT level */
 #define SMT_LEVEL  0
 
-/* leaf 0xb sub-leaf types */
+/* extended topology sub-leaf types */
 #define INVALID_TYPE   0
 #define SMT_TYPE   1
 #define CORE_TYPE  2
+#define DIE_TYPE   5
 
 #define LEAFB_SUBTYPE(ecx) (((ecx) >> 8) & 0xff)
 #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f)
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
-int detect_extended_topology_early(struct cpuinfo_x86 *c)
-{
 #ifdef CONFIG_SMP
+/*
+ * Check if given CPUID extended toplogy "leaf" is implemented
+ */
+static int check_extended_topology_leaf(int leaf)
+{
unsigned int eax, ebx, ecx, edx;
 
-   if (c->cpuid_level < 0xb)
+   cpuid_count(leaf, SMT_LEVEL, , , , );
+
+   if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))
return -1;
 
-   cpuid_count(0xb, SMT_LEVEL, , , , );
+   return 0;
+}
+/*
+ * Return best CPUID Extended Toplogy Leaf supported
+ */
+static int detect_extended_topology_leaf(struct cpuinfo_x86 *c)
+{
+   if (c->cpuid_level >= 0x1f) {
+   if (check_extended_topology_leaf(0x1f) == 0)
+   return 0x1f;
+   }
 
-   /*
-* check if the cpuid leaf 0xb is actually implemented.
-*/
-   if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))
+   if (c->cpuid_level >= 0xb) {
+   if (check_extended_topology_leaf(0xb) == 0)
+   return 0xb;
+   }
+
+   return -1;
+}
+#endif
+
+int detect_extended_topology_early(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   unsigned int eax, ebx, ecx, edx;
+   int leaf;
+
+   leaf = detect_extended_topology_leaf(c);
+   if (leaf < 0)
return -1;
 
set_cpu_cap(c, X86_FEATURE_XTOPOLOGY);
 
+   cpuid_count(leaf, SMT_LEVEL, , , , );
/*
 * initial apic id, which also represents 32-bit extended x2apic id.
 */
@@ -52,7 +82,7 @@ int detect_extended_topology_early(struct cpuinfo_x

[PATCH 05/19] x86 topology: Define topology_logical_die_id()

2019-05-13 Thread Len Brown

From: Len Brown 

Define topology_logical_die_id() ala
existing topology_logical_package_id()

Tested-by: Zhang Rui 
Signed-off-by: Len Brown 
---
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/include/asm/topology.h  |  5 
 arch/x86/kernel/cpu/common.c |  1 +
 arch/x86/kernel/smpboot.c| 45 
 4 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 87d42c0c6ccc..603ce3fe578d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -118,6 +118,7 @@ struct cpuinfo_x86 {
/* Core id: */
u16 cpu_core_id;
u16 cpu_die_id;
+   u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
u32 microcode;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 3777dbe9c0ff..9de16b4f6023 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
+#define topology_logical_die_id(cpu)   (cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)   (cpu_data(cpu).cpu_die_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
@@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void)
 }
 
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
+int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
+int topology_phys_to_logical_die(unsigned int die, unsigned int cpu);
 bool topology_is_primary_thread(unsigned int cpu);
 bool topology_smt_supported(void);
 #else
 #define topology_max_packages()(1)
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
+static inline int
+topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_phys_to_logical_die(unsigned int die,
unsigned int cpu) { return 0; }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8739bdfe9bdf..48de7f069ac7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1277,6 +1277,7 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
   cpu, apicid, c->initial_apicid);
}
BUG_ON(topology_update_package_map(c->phys_proc_id, cpu));
+   BUG_ON(topology_update_die_map(c->cpu_die_id, cpu));
 #else
c->logical_proc_id = 0;
 #endif
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 40ffe23249c0..a6e01b6c2709 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 static unsigned int logical_packages __read_mostly;
+static unsigned int logical_die __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __read_mostly __max_smt_threads = 1;
@@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg)
return -1;
 }
 EXPORT_SYMBOL(topology_phys_to_logical_pkg);
+/**
+ * topology_phys_to_logical_die - Map a physical die id to logical
+ *
+ * Returns logical die id or -1 if not found
+ */
+int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
+{
+   int cpu;
+   int proc_id = cpu_data(cur_cpu).phys_proc_id;
+
+   for_each_possible_cpu(cpu) {
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   if (c->initialized && c->cpu_die_id == die_id &&
+   c->phys_proc_id == proc_id)
+   return c->logical_die_id;
+   }
+   return -1;
+}
+EXPORT_SYMBOL(topology_phys_to_logical_die);
 
 /**
  * topology_update_package_map - Update the physical to logical package map
@@ -326,6 +347,29 @@ int topology_update_package_map(unsigned int pkg, unsigned 
int cpu)
cpu_data(cpu).logical_proc_id = new;
return 0;
 }
+/**
+ * topology_update_die_map - Update the physical to logical die map
+ * @die:   The die id as retrieved via CPUID
+ * @cpu:   The cpu for which this is updated
+ */
+int topology_update_die_map(unsigned int die, unsigned int cpu)
+{
+   int new;
+
+   /* Already available somewhere? */
+   new = topology_phys_to_logical_die(die, cpu);
+   if (new >= 0)
+   goto found;
+
+   new = logical_die++;
+   if (new != die) {

[PATCH 03/19] cpu topology: Export die_id

2019-05-13 Thread Len Brown

From: Len Brown 

Export die_id in cpu topology, for the benefit of hardware that
has multiple-die/package.

Signed-off-by: Len Brown 
Cc: linux-...@vger.kernel.org
---
 Documentation/cputopology.txt | 15 ---
 drivers/base/topology.c   |  4 
 include/linux/topology.h  |  3 +++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index cb61277e2308..2ff8a1e9a2db 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -12,6 +12,12 @@ physical_package_id:
socket number, but the actual value is architecture and platform
dependent.
 
+die_id:
+
+   the CPU die ID of cpuX. Typically it is the hardware platform's
+   identifier (rather than the kernel's).  The actual value is
+   architecture and platform dependent.
+
 core_id:
 
the CPU core ID of cpuX. Typically it is the hardware platform's
@@ -81,6 +87,7 @@ For an architecture to support this feature, it must define 
some of
 these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
+   #define topology_die_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
@@ -99,9 +106,11 @@ provides default definitions for any of the above macros 
that are
 not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
-2) topology_core_id: 0
-3) topology_sibling_cpumask: just the given CPU
-4) topology_core_cpumask: just the given CPU
+2) topology_die_id: -1
+3) topology_core_id: 0
+4) topology_sibling_cpumask: just the given CPU
+5) topology_core_cpumask: just the given CPU
+6) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 5fd9f167ecc1..50352cf96f85 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -43,6 +43,9 @@ static ssize_t name##_list_show(struct device *dev,   
\
 define_id_show_func(physical_package_id);
 static DEVICE_ATTR_RO(physical_package_id);
 
+define_id_show_func(die_id);
+static DEVICE_ATTR_RO(die_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -72,6 +75,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list);
 
 static struct attribute *default_attrs[] = {
_attr_physical_package_id.attr,
+   _attr_die_id.attr,
_attr_core_id.attr,
_attr_thread_siblings.attr,
_attr_thread_siblings_list.attr,
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e1ee4b..5cc8595dd0e4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -184,6 +184,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_physical_package_id
 #define topology_physical_package_id(cpu)  ((void)(cpu), -1)
 #endif
+#ifndef topology_die_id
+#define topology_die_id(cpu)   ((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)  ((void)(cpu), 0)
 #endif
-- 
2.18.0-rc0

[PATCH 06/19] powercap/intel_rapl: Simplify rapl_find_package()

2019-05-13 Thread Len Brown

From: Zhang Rui 

Syntax only, no functional or semantic change.

Simplify how the code to discover a package is called.
Rename find_package_by_id() to rapl_find_package_domain()

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
Acked-by: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
---
 drivers/powercap/intel_rapl.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 4347f15165f8..3c3c0c23180b 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -264,8 +264,9 @@ static struct powercap_control_type *control_type; /* 
PowerCap Controller */
 static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */
 
 /* caller to ensure CPU hotplug lock is held */
-static struct rapl_package *find_package_by_id(int id)
+static struct rapl_package *rapl_find_package_domain(int cpu)
 {
+   int id = topology_physical_package_id(cpu);
struct rapl_package *rp;
 
list_for_each_entry(rp, _packages, plist) {
@@ -1300,7 +1301,7 @@ static int __init rapl_register_psys(void)
rd->rpl[0].name = pl1_name;
rd->rpl[1].prim_id = PL2_ENABLE;
rd->rpl[1].name = pl2_name;
-   rd->rp = find_package_by_id(0);
+   rd->rp = rapl_find_package_domain(0);
 
power_zone = powercap_register_zone(>power_zone, control_type,
"psys", NULL,
@@ -1456,8 +1457,9 @@ static void rapl_remove_package(struct rapl_package *rp)
 }
 
 /* called from CPU hotplug notifier, hotplug lock held */
-static struct rapl_package *rapl_add_package(int cpu, int pkgid)
+static struct rapl_package *rapl_add_package(int cpu)
 {
+   int id = topology_physical_package_id(cpu);
struct rapl_package *rp;
int ret;
 
@@ -1466,7 +1468,7 @@ static struct rapl_package *rapl_add_package(int cpu, int 
pkgid)
return ERR_PTR(-ENOMEM);
 
/* add the new package to the list */
-   rp->id = pkgid;
+   rp->id = id;
rp->lead_cpu = cpu;
 
/* check if the package contains valid domains */
@@ -1497,12 +1499,11 @@ static struct rapl_package *rapl_add_package(int cpu, 
int pkgid)
  */
 static int rapl_cpu_online(unsigned int cpu)
 {
-   int pkgid = topology_physical_package_id(cpu);
struct rapl_package *rp;
 
-   rp = find_package_by_id(pkgid);
+   rp = rapl_find_package_domain(cpu);
if (!rp) {
-   rp = rapl_add_package(cpu, pkgid);
+   rp = rapl_add_package(cpu);
if (IS_ERR(rp))
return PTR_ERR(rp);
}
@@ -1512,11 +1513,10 @@ static int rapl_cpu_online(unsigned int cpu)
 
 static int rapl_cpu_down_prep(unsigned int cpu)
 {
-   int pkgid = topology_physical_package_id(cpu);
struct rapl_package *rp;
int lead_cpu;
 
-   rp = find_package_by_id(pkgid);
+   rp = rapl_find_package_domain(cpu);
if (!rp)
return 0;
 
-- 
2.18.0-rc0

[PATCH 08/19] thermal/x86_pkg_temp_thermal: Support multi-die/package

2019-05-13 Thread Len Brown

From: Zhang Rui 

Package temperature sensors are actually implemented in hardware per-die.
Thus, the new multi-die/package systems sport mulitple package thermal
zones for each package.

Update the x86_pkg_temp_thermal to be "multi-die-aware", so it can
expose multiple zones per package, instead of just one.

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
---
 drivers/thermal/intel/x86_pkg_temp_thermal.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c 
b/drivers/thermal/intel/x86_pkg_temp_thermal.c
index 1ef937d799e4..405b3858900a 100644
--- a/drivers/thermal/intel/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c
@@ -122,7 +122,7 @@ static int pkg_temp_debugfs_init(void)
  */
 static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu)
 {
-   int pkgid = topology_logical_package_id(cpu);
+   int pkgid = topology_logical_die_id(cpu);
 
if (pkgid >= 0 && pkgid < max_packages)
return packages[pkgid];
@@ -353,7 +353,7 @@ static int pkg_thermal_notify(u64 msr_val)
 
 static int pkg_temp_thermal_device_add(unsigned int cpu)
 {
-   int pkgid = topology_logical_package_id(cpu);
+   int pkgid = topology_logical_die_id(cpu);
u32 tj_max, eax, ebx, ecx, edx;
struct pkg_device *pkgdev;
int thres_count, err;
@@ -449,7 +449,7 @@ static int pkg_thermal_cpu_offline(unsigned int cpu)
 * worker will see the package anymore.
 */
if (lastcpu) {
-   packages[topology_logical_package_id(cpu)] = NULL;
+   packages[topology_logical_die_id(cpu)] = NULL;
/* After this point nothing touches the MSR anymore. */
wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT,
  pkgdev->msr_pkg_therm_low, pkgdev->msr_pkg_therm_high);
@@ -515,7 +515,7 @@ static int __init pkg_temp_thermal_init(void)
if (!x86_match_cpu(pkg_temp_thermal_ids))
return -ENODEV;
 
-   max_packages = topology_max_packages();
+   max_packages = topology_max_packages() * topology_max_die_per_package();
packages = kcalloc(max_packages, sizeof(struct pkg_device *),
   GFP_KERNEL);
if (!packages)
-- 
2.18.0-rc0

[PATCH 02/19] x86 topology: Create topology_max_die_per_package()

2019-05-13 Thread Len Brown

From: Len Brown 

topology_max_packages() is available to size resources to
cover all packages in the system.

But now we have multi-die/package systems, and some
resources are per-die.

Create topology_max_die_per_package(), for detecting
multi-die/package systems, and sizing any per-die resources.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/processor.h |  1 -
 arch/x86/include/asm/topology.h  | 10 ++
 arch/x86/kernel/cpu/topology.c   |  5 -
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 00fc03a8da59..87d42c0c6ccc 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -106,7 +106,6 @@ struct cpuinfo_x86 {
unsigned long   loops_per_jiffy;
/* cpuid returned max cores value: */
u16 x86_max_cores;
-   u16 x86_max_dies;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38a1c33..e0232f7042c3 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
 
+extern unsigned int __max_die_per_package;
+
+static inline int topology_max_die_per_package(void)
+{
+   return __max_die_per_package;
+}
+
 extern int __max_smt_threads;
 
 static inline int topology_max_smt_threads(void)
@@ -131,6 +138,9 @@ bool topology_smt_supported(void);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_phys_to_logical_die(unsigned int die,
+   unsigned int cpu) { return 0; }
+static inline int topology_max_die_per_package(void) { return 1; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
 static inline bool topology_smt_supported(void) { return false; }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 4d17e699657d..ee48c3fc8a65 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -26,6 +26,9 @@
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
 #ifdef CONFIG_SMP
+unsigned int __max_die_per_package __read_mostly = 1;
+EXPORT_SYMBOL(__max_die_per_package);
+
 /*
  * Check if given CPUID extended toplogy "leaf" is implemented
  */
@@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
 
c->x86_max_cores = (core_level_siblings / smp_num_siblings);
-   c->x86_max_dies = (die_level_siblings / core_level_siblings);
+   __max_die_per_package = (die_level_siblings / core_level_siblings);
 #endif
return 0;
 }
-- 
2.18.0-rc0

[PATCH 12/19] topology: Create core_cpus and die_cpus sysfs attributes

2019-05-13 Thread Len Brown

From: Len Brown 

Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list"

These attributes represent all of the logical CPUs that share the
same core.

These attriutes is synonymous with the existing "thread_siblings" and
"thread_siblings_list" attribute, which will be deprecated.

Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list".
These attributes represent all of the logical CPUs that share the
same die.

Signed-off-by: Len Brown 
Suggested-by: Brice Goglin 
---
 Documentation/cputopology.txt   | 21 +++--
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 22 ++
 arch/x86/xen/smp_pv.c   |  1 +
 drivers/base/topology.c | 12 
 include/linux/topology.h|  3 +++
 7 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 48af5c290e20..b90dafcc8237 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -36,15 +36,15 @@ drawer_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
-thread_siblings:
+core_cpus:
 
-   internal kernel map of cpuX's hardware threads within the same
-   core as cpuX.
+   internal kernel map of CPUs within the same core.
+   (deprecated name: "thread_siblings")
 
-thread_siblings_list:
+core_cpus_list:
 
-   human-readable list of cpuX's hardware threads within the same
-   core as cpuX.
+   human-readable list of CPUs within the same core.
+   (deprecated name: "thread_siblings_list");
 
 package_cpus:
 
@@ -56,6 +56,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+die_cpus:
+
+   internal kernel map of CPUs within the same die.
+
+die_cpus_list:
+
+   human-readable list of CPUs within the same die.
+
 book_siblings:
 
internal kernel map of cpuX's hardware threads within the same
@@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h::
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df207b2..b673a226ad6c 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -23,6 +23,7 @@ extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9de16b4f6023..4b14d2318251 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
 #ifdef CONFIG_SMP
+#define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a6e01b6c2709..1a19a5171949 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
+/* representing HT, core, and die siblings of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
+EXPORT_PER_CPU_SYMBOL(cpu_die_map);
+
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
 /* Per CPU bogomips and other parameters */
@@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
return false;
 }
 
+static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+   if ((c->phys_proc_id == o->phys_proc_id) &&
+   (c->cpu_die_id == o->cpu_die_id))
+   return true;
+   return false;
+}
+
+
 #if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
@@ -571,6 +584,7 @@ void set_cpu_sibling_map(int cpu)
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
c

[PATCH 16/19] thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to zones from packages

2019-05-13 Thread Len Brown

From: Len Brown 

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names
to use the more generic thermal "zone" terminology, instead of "package",
as the zones can refer to either packages or die.

Signed-off-by: Len Brown 
Cc: Zhang Rui 
---
 drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++-
 1 file changed, 72 insertions(+), 70 deletions(-)

diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c 
b/drivers/thermal/intel/x86_pkg_temp_thermal.c
index 405b3858900a..87e929ffb0cb 100644
--- a/drivers/thermal/intel/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c
@@ -55,7 +55,7 @@ MODULE_PARM_DESC(notify_delay_ms,
 */
 #define MAX_NUMBER_OF_TRIPS2
 
-struct pkg_device {
+struct zone_device {
int cpu;
boolwork_scheduled;
u32 tj_max;
@@ -70,10 +70,10 @@ static struct thermal_zone_params pkg_temp_tz_params = {
.no_hwmon   = true,
 };
 
-/* Keep track of how many package pointers we allocated in init() */
-static int max_packages __read_mostly;
-/* Array of package pointers */
-static struct pkg_device **packages;
+/* Keep track of how many zone pointers we allocated in init() */
+static int max_id __read_mostly;
+/* Array of zone pointers */
+static struct zone_device **zones;
 /* Serializes interrupt notification, work and hotplug */
 static DEFINE_SPINLOCK(pkg_temp_lock);
 /* Protects zone operation in the work function against hotplug removal */
@@ -120,12 +120,12 @@ static int pkg_temp_debugfs_init(void)
  *
  * - Other callsites: Must hold pkg_temp_lock
  */
-static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu)
+static struct zone_device *pkg_temp_thermal_get_dev(unsigned int cpu)
 {
-   int pkgid = topology_logical_die_id(cpu);
+   int id = topology_logical_die_id(cpu);
 
-   if (pkgid >= 0 && pkgid < max_packages)
-   return packages[pkgid];
+   if (id >= 0 && id < max_id)
+   return zones[id];
return NULL;
 }
 
@@ -150,12 +150,13 @@ static int get_tj_max(int cpu, u32 *tj_max)
 
 static int sys_get_curr_temp(struct thermal_zone_device *tzd, int *temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
u32 eax, edx;
 
-   rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_STATUS, , );
+   rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_STATUS,
+   , );
if (eax & 0x8000) {
-   *temp = pkgdev->tj_max - ((eax >> 16) & 0x7f) * 1000;
+   *temp = zonedev->tj_max - ((eax >> 16) & 0x7f) * 1000;
pr_debug("sys_get_curr_temp %d\n", *temp);
return 0;
}
@@ -165,7 +166,7 @@ static int sys_get_curr_temp(struct thermal_zone_device 
*tzd, int *temp)
 static int sys_get_trip_temp(struct thermal_zone_device *tzd,
 int trip, int *temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
unsigned long thres_reg_value;
u32 mask, shift, eax, edx;
int ret;
@@ -181,14 +182,14 @@ static int sys_get_trip_temp(struct thermal_zone_device 
*tzd,
shift = THERM_SHIFT_THRESHOLD0;
}
 
-   ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
+   ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
   , );
if (ret < 0)
return ret;
 
thres_reg_value = (eax & mask) >> shift;
if (thres_reg_value)
-   *temp = pkgdev->tj_max - thres_reg_value * 1000;
+   *temp = zonedev->tj_max - thres_reg_value * 1000;
else
*temp = 0;
pr_debug("sys_get_trip_temp %d\n", *temp);
@@ -199,14 +200,14 @@ static int sys_get_trip_temp(struct thermal_zone_device 
*tzd,
 static int
 sys_set_trip_temp(struct thermal_zone_device *tzd, int trip, int temp)
 {
-   struct pkg_device *pkgdev = tzd->devdata;
+   struct zone_device *zonedev = tzd->devdata;
u32 l, h, mask, shift, intr;
int ret;
 
-   if (trip >= MAX_NUMBER_OF_TRIPS || temp >= pkgdev->tj_max)
+   if (trip >= MAX_NUMBER_OF_TRIPS || temp >= zonedev->tj_max)
return -EINVAL;
 
-   ret = rdmsr_on_cpu(pkgdev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
+   ret = rdmsr_on_cpu(zonedev->cpu, MSR_IA32_PACKAGE_THERM_INTERRUPT,
   , );
if (ret < 0)
return ret;
@@ -228,11 +229,12 @@ sys_set_trip_temp(struct thermal_zone_device *tzd, int 
trip, int temp)
if (!temp) {
l

[PATCH 18/19] perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg support

2019-05-13 Thread Len Brown

From: Kan Liang 

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names
to use "die" terminology, instead of "pkg".

For previous platforms which doesn't have multi-die, "die" is identical
as "pkg".

Signed-off-by: Kan Liang 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/uncore.c   | 74 ++--
 arch/x86/events/intel/uncore.h   |  4 +-
 arch/x86/events/intel/uncore_snbep.c |  4 +-
 3 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index aeb5eae83750..f943e4d0e66c 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -14,7 +14,7 @@ struct pci_driver *uncore_pci_driver;
 DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
 struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_extra_dev *uncore_extra_pci_dev;
-static int max_packages;
+static int max_dies;
 
 /* mask of cpus that collect uncore events */
 static cpumask_t uncore_cpu_mask;
@@ -100,13 +100,13 @@ ssize_t uncore_event_show(struct kobject *kobj,
 
 struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int 
cpu)
 {
-   unsigned int pkgid = topology_logical_die_id(cpu);
+   unsigned int dieid = topology_logical_die_id(cpu);
 
/*
 * The unsigned check also catches the '-1' return value for non
 * existent mappings in the topology map.
 */
-   return pkgid < max_packages ? pmu->boxes[pkgid] : NULL;
+   return dieid < max_dies ? pmu->boxes[dieid] : NULL;
 }
 
 u64 uncore_msr_read_counter(struct intel_uncore_box *box, struct perf_event 
*event)
@@ -311,7 +311,7 @@ static struct intel_uncore_box *uncore_alloc_box(struct 
intel_uncore_type *type,
uncore_pmu_init_hrtimer(box);
box->cpu = -1;
box->pci_phys_id = -1;
-   box->pkgid = -1;
+   box->dieid = -1;
 
/* set default hrtimer timeout */
box->hrtimer_duration = UNCORE_PMU_HRTIMER_INTERVAL;
@@ -826,10 +826,10 @@ static void uncore_pmu_unregister(struct intel_uncore_pmu 
*pmu)
 
 static void uncore_free_boxes(struct intel_uncore_pmu *pmu)
 {
-   int pkg;
+   int die;
 
-   for (pkg = 0; pkg < max_packages; pkg++)
-   kfree(pmu->boxes[pkg]);
+   for (die = 0; die < max_dies; die++)
+   kfree(pmu->boxes[die]);
kfree(pmu->boxes);
 }
 
@@ -866,7 +866,7 @@ static int __init uncore_type_init(struct intel_uncore_type 
*type, bool setid)
if (!pmus)
return -ENOMEM;
 
-   size = max_packages * sizeof(struct intel_uncore_box *);
+   size = max_dies * sizeof(struct intel_uncore_box *);
 
for (i = 0; i < type->num_boxes; i++) {
pmus[i].func_id = setid ? i : -1;
@@ -936,21 +936,21 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
struct intel_uncore_type *type;
struct intel_uncore_pmu *pmu = NULL;
struct intel_uncore_box *box;
-   int phys_id, pkg, ret;
+   int phys_id, die, ret;
 
phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
-   pkg = (topology_max_die_per_package() > 1) ? phys_id :
+   die = (topology_max_die_per_package() > 1) ? phys_id :
topology_phys_to_logical_pkg(phys_id);
-   if (pkg < 0)
+   if (die < 0)
return -EINVAL;
 
if (UNCORE_PCI_DEV_TYPE(id->driver_data) == UNCORE_EXTRA_PCI_DEV) {
int idx = UNCORE_PCI_DEV_IDX(id->driver_data);
 
-   uncore_extra_pci_dev[pkg].dev[idx] = pdev;
+   uncore_extra_pci_dev[die].dev[idx] = pdev;
pci_set_drvdata(pdev, NULL);
return 0;
}
@@ -989,7 +989,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
pmu = >pmus[UNCORE_PCI_DEV_IDX(id->driver_data)];
}
 
-   if (WARN_ON_ONCE(pmu->boxes[pkg] != NULL))
+   if (WARN_ON_ONCE(pmu->boxes[die] != NULL))
return -EINVAL;
 
box = uncore_alloc_box(type, NUMA_NO_NODE);
@@ -1003,13 +1003,13 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
 
atomic_inc(>refcnt);
box->pci_phys_id = phys_id;
-   box->pkgid = pkg;
+   box->dieid = die;
box->pci_dev = pdev;
box->pmu = pmu;
uncore_box_init(box);
pci_set_drvdata(pdev, box);
 
-   pmu->boxes[pkg] = box;
+   pmu->boxes[die] = box;
if (atomic_inc_return(>activeboxes) > 1)
return 0;
 
@@ -1017,7 +1017,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
ret = un

[PATCH 17/19] hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages

2019-05-13 Thread Len Brown

From: Len Brown 

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names
to use the more generic thermal "zone" terminology, instead of "package",
as the zones can refer to either packages or die.

Signed-off-by: Len Brown 
Cc: Zhang Rui 
---
 drivers/hwmon/coretemp.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index c64ce32d3214..4ebed90d81aa 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -109,10 +109,10 @@ struct platform_data {
struct device_attribute name_attr;
 };
 
-/* Keep track of how many package pointers we allocated in init() */
-static int max_packages __read_mostly;
-/* Array of package pointers. Serialized by cpu hotplug lock */
-static struct platform_device **pkg_devices;
+/* Keep track of how many zone pointers we allocated in init() */
+static int max_zones __read_mostly;
+/* Array of zone pointers. Serialized by cpu hotplug lock */
+static struct platform_device **zone_devices;
 
 static ssize_t show_label(struct device *dev,
struct device_attribute *devattr, char *buf)
@@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu)
 
 static struct platform_device *coretemp_get_pdev(unsigned int cpu)
 {
-   int pkgid = topology_logical_die_id(cpu);
+   int id = topology_logical_die_id(cpu);
 
-   if (pkgid >= 0 && pkgid < max_packages)
-   return pkg_devices[pkgid];
+   if (id >= 0 && id < max_zones)
+   return zone_devices[id];
return NULL;
 }
 
@@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev)
struct device *dev = >dev;
struct platform_data *pdata;
 
-   /* Initialize the per-package data structures */
+   /* Initialize the per-zone data structures */
pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL);
if (!pdata)
return -ENOMEM;
@@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = {
 
 static struct platform_device *coretemp_device_add(unsigned int cpu)
 {
-   int err, pkgid = topology_logical_die_id(cpu);
+   int err, zoneid = topology_logical_die_id(cpu);
struct platform_device *pdev;
 
-   if (pkgid < 0)
+   if (zoneid < 0)
return ERR_PTR(-ENOMEM);
 
-   pdev = platform_device_alloc(DRVNAME, pkgid);
+   pdev = platform_device_alloc(DRVNAME, zoneid);
if (!pdev)
return ERR_PTR(-ENOMEM);
 
@@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned 
int cpu)
return ERR_PTR(err);
}
 
-   pkg_devices[pkgid] = pdev;
+   zone_devices[zoneid] = pdev;
return pdev;
 }
 
@@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu)
 * the rest.
 */
if (cpumask_empty(>cpumask)) {
-   pkg_devices[topology_logical_die_id(cpu)] = NULL;
+   zone_devices[topology_logical_die_id(cpu)] = NULL;
platform_device_unregister(pdev);
return 0;
}
@@ -741,10 +741,10 @@ static int __init coretemp_init(void)
if (!x86_match_cpu(coretemp_ids))
return -ENODEV;
 
-   max_packages = topology_max_packages() * topology_max_die_per_package();
-   pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *),
+   max_zones = topology_max_packages() * topology_max_die_per_package();
+   zone_devices = kcalloc(max_zones, sizeof(struct platform_device *),
  GFP_KERNEL);
-   if (!pkg_devices)
+   if (!zone_devices)
return -ENOMEM;
 
err = platform_driver_register(_driver);
@@ -760,7 +760,7 @@ static int __init coretemp_init(void)
 
 outdrv:
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
return err;
 }
 module_init(coretemp_init)
@@ -769,7 +769,7 @@ static void __exit coretemp_exit(void)
 {
cpuhp_remove_state(coretemp_hp_online);
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
 }
 module_exit(coretemp_exit)
 
-- 
2.18.0-rc0

[PATCH 13/19] perf/x86/intel/uncore: Support multi-die/package

2019-05-13 Thread Len Brown

From: Kan Liang 

Uncore becomes die-scope on Xeon Cascade Lake-AP. Uncore driver needs to
support die-scope uncore units.

Use topology_logical_die_id() to replace topology_logical_package_id().
For previous platforms which doesn't have multi-die,
topology_logical_die_id() is identical as topology_logical_package_id().

In pci_probe/remove, the group id reads from PCI BUS is logical die id
for multi-die systems.

Use topology_die_cpumask() to replace topology_core_cpumask().
For previous platforms which doesn't have multi-die,
topology_die_cpumask() is identical as topology_core_cpumask().

There is no functional change for previous platforms.

Signed-off-by: Kan Liang 
Cc: Peter Zijlstra 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/uncore.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index fc40a1473058..aeb5eae83750 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -100,7 +100,7 @@ ssize_t uncore_event_show(struct kobject *kobj,
 
 struct intel_uncore_box *uncore_pmu_to_box(struct intel_uncore_pmu *pmu, int 
cpu)
 {
-   unsigned int pkgid = topology_logical_package_id(cpu);
+   unsigned int pkgid = topology_logical_die_id(cpu);
 
/*
 * The unsigned check also catches the '-1' return value for non
@@ -942,7 +942,8 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
if (phys_id < 0)
return -ENODEV;
 
-   pkg = topology_phys_to_logical_pkg(phys_id);
+   pkg = (topology_max_die_per_package() > 1) ? phys_id :
+   topology_phys_to_logical_pkg(phys_id);
if (pkg < 0)
return -EINVAL;
 
@@ -1033,7 +1034,8 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 
box = pci_get_drvdata(pdev);
if (!box) {
-   pkg = topology_phys_to_logical_pkg(phys_id);
+   pkg = (topology_max_die_per_package() > 1) ? phys_id :
+   topology_phys_to_logical_pkg(phys_id);
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
uncore_extra_pci_dev[pkg].dev[i] = NULL;
@@ -1110,7 +1112,7 @@ static void uncore_change_type_ctx(struct 
intel_uncore_type *type, int old_cpu,
struct intel_uncore_box *box;
int i, pkg;
 
-   pkg = topology_logical_package_id(old_cpu < 0 ? new_cpu : old_cpu);
+   pkg = topology_logical_die_id(old_cpu < 0 ? new_cpu : old_cpu);
for (i = 0; i < type->num_boxes; i++, pmu++) {
box = pmu->boxes[pkg];
if (!box)
@@ -1151,7 +1153,7 @@ static int uncore_event_cpu_offline(unsigned int cpu)
if (!cpumask_test_and_clear_cpu(cpu, _cpu_mask))
goto unref;
/* Find a new cpu to collect uncore events */
-   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+   target = cpumask_any_but(topology_die_cpumask(cpu), cpu);
 
/* Migrate uncore events to the new target */
if (target < nr_cpu_ids)
@@ -1164,7 +1166,7 @@ static int uncore_event_cpu_offline(unsigned int cpu)
 
 unref:
/* Clear the references */
-   pkg = topology_logical_package_id(cpu);
+   pkg = topology_logical_die_id(cpu);
for (; *types; types++) {
type = *types;
pmu = type->pmus;
@@ -1223,7 +1225,7 @@ static int uncore_event_cpu_online(unsigned int cpu)
struct intel_uncore_box *box;
int i, ret, pkg, target;
 
-   pkg = topology_logical_package_id(cpu);
+   pkg = topology_logical_die_id(cpu);
ret = allocate_boxes(types, pkg, cpu);
if (ret)
return ret;
@@ -1242,7 +1244,7 @@ static int uncore_event_cpu_online(unsigned int cpu)
 * Check if there is an online cpu in the package
 * which collects uncore events already.
 */
-   target = cpumask_any_and(_cpu_mask, topology_core_cpumask(cpu));
+   target = cpumask_any_and(_cpu_mask, topology_die_cpumask(cpu));
if (target < nr_cpu_ids)
return 0;
 
@@ -1417,7 +1419,7 @@ static int __init intel_uncore_init(void)
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return -ENODEV;
 
-   max_packages = topology_max_packages();
+   max_packages = topology_max_packages() * topology_max_die_per_package();
 
uncore_init = (struct intel_uncore_init_fun *)id->driver_data;
if (uncore_init->pci_init) {
-- 
2.18.0-rc0

[PATCH 15/19] perf/x86/intel/cstate: Support multi-die/package

2019-05-13 Thread Len Brown

From: Kan Liang 

Some cstate counters becomes die-scope on Xeon Cascade Lake-AP. Perf
cstate driver needs to support die-scope cstate counters.

Use topology_die_cpumask() to replace topology_core_cpumask().
For previous platforms which doesn't have multi-die,
topology_die_cpumask() is identical as topology_core_cpumask().
There is no functional change for previous platforms.

Name the die-scope PMU "cstate_die".

Signed-off-by: Kan Liang 
Cc: Peter Zijlstra 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/cstate.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 6072f92cb8ea..267d7f8e12ab 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -302,7 +302,7 @@ static int cstate_pmu_event_init(struct perf_event *event)
return -EINVAL;
event->hw.event_base = pkg_msr[cfg].msr;
cpu = cpumask_any_and(_pkg_cpu_mask,
- topology_core_cpumask(event->cpu));
+ topology_die_cpumask(event->cpu));
} else {
return -ENOENT;
}
@@ -385,7 +385,7 @@ static int cstate_cpu_exit(unsigned int cpu)
if (has_cstate_pkg &&
cpumask_test_and_clear_cpu(cpu, _pkg_cpu_mask)) {
 
-   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+   target = cpumask_any_but(topology_die_cpumask(cpu), cpu);
/* Migrate events if there is a valid target */
if (target < nr_cpu_ids) {
cpumask_set_cpu(target, _pkg_cpu_mask);
@@ -414,7 +414,7 @@ static int cstate_cpu_init(unsigned int cpu)
 * in the package cpu mask as the designated reader.
 */
target = cpumask_any_and(_pkg_cpu_mask,
-topology_core_cpumask(cpu));
+topology_die_cpumask(cpu));
if (has_cstate_pkg && target >= nr_cpu_ids)
cpumask_set_cpu(cpu, _pkg_cpu_mask);
 
@@ -663,7 +663,13 @@ static int __init cstate_init(void)
}
 
if (has_cstate_pkg) {
-   err = perf_pmu_register(_pkg_pmu, cstate_pkg_pmu.name, 
-1);
+   if (topology_max_die_per_package() > 1) {
+   err = perf_pmu_register(_pkg_pmu,
+   "cstate_die", -1);
+   } else {
+   err = perf_pmu_register(_pkg_pmu,
+   cstate_pkg_pmu.name, -1);
+   }
if (err) {
has_cstate_pkg = false;
pr_info("Failed to register cstate pkg pmu\n");
-- 
2.18.0-rc0

[PATCH 09/19] powercap/intel_rapl: Update RAPL domain name and debug messages

2019-05-13 Thread Len Brown

From: Zhang Rui 

The RAPL domain "name" attribute contains "Package-N",
which is ambiguous on multi-die per-package systems.

Update the name to "package-X-die-Y" on those systems.

No change on systems without multi-die/package.

Update driver debug messages.

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
Acked-by: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
---
 drivers/powercap/intel_rapl.c | 57 ---
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 9202dbcef96d..ad78c1d08260 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -178,12 +178,15 @@ struct rapl_domain {
 #define power_zone_to_rapl_domain(_zone) \
container_of(_zone, struct rapl_domain, power_zone)
 
+/* maximum rapl package domain name: package-%d-die-%d */
+#define PACKAGE_DOMAIN_NAME_LENGTH 30
 
-/* Each physical package contains multiple domains, these are the common
+
+/* Each rapl package contains multiple domains, these are the common
  * data across RAPL domains within a package.
  */
 struct rapl_package {
-   unsigned int id; /* physical package/socket id */
+   unsigned int id; /* logical die id, equals physical 1-die systems */
unsigned int nr_domains;
unsigned long domain_map; /* bit map of active domains */
unsigned int power_unit;
@@ -198,6 +201,7 @@ struct rapl_package {
int lead_cpu; /* one active cpu per package for access */
/* Track active cpus */
struct cpumask cpumask;
+   char name[PACKAGE_DOMAIN_NAME_LENGTH];
 };
 
 struct rapl_defaults {
@@ -926,8 +930,8 @@ static int rapl_check_unit_core(struct rapl_package *rp, 
int cpu)
value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
rp->time_unit = 100 / (1 << value);
 
-   pr_debug("Core CPU package %d energy=%dpJ, time=%dus, power=%duW\n",
-   rp->id, rp->energy_unit, rp->time_unit, rp->power_unit);
+   pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n",
+   rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
 
return 0;
 }
@@ -951,8 +955,8 @@ static int rapl_check_unit_atom(struct rapl_package *rp, 
int cpu)
value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
rp->time_unit = 100 / (1 << value);
 
-   pr_debug("Atom package %d energy=%dpJ, time=%dus, power=%duW\n",
-   rp->id, rp->energy_unit, rp->time_unit, rp->power_unit);
+   pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n",
+   rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
 
return 0;
 }
@@ -1181,7 +1185,7 @@ static void rapl_update_domain_data(struct rapl_package 
*rp)
u64 val;
 
for (dmn = 0; dmn < rp->nr_domains; dmn++) {
-   pr_debug("update package %d domain %s data\n", rp->id,
+   pr_debug("update %s domain %s data\n", rp->name,
 rp->domains[dmn].name);
/* exclude non-raw primitives */
for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) {
@@ -1206,7 +1210,6 @@ static void rapl_unregister_powercap(void)
 static int rapl_package_register_powercap(struct rapl_package *rp)
 {
struct rapl_domain *rd;
-   char dev_name[17]; /* max domain name = 7 + 1 + 8 for int + 1 for null*/
struct powercap_zone *power_zone = NULL;
int nr_pl, ret;
 
@@ -1217,20 +1220,16 @@ static int rapl_package_register_powercap(struct 
rapl_package *rp)
for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) {
if (rd->id == RAPL_DOMAIN_PACKAGE) {
nr_pl = find_nr_power_limit(rd);
-   pr_debug("register socket %d package domain %s\n",
-   rp->id, rd->name);
-   memset(dev_name, 0, sizeof(dev_name));
-   snprintf(dev_name, sizeof(dev_name), "%s-%d",
-   rd->name, rp->id);
+   pr_debug("register package domain %s\n", rp->name);
power_zone = powercap_register_zone(>power_zone,
control_type,
-   dev_name, NULL,
+   rp->name, NULL,
_ops[rd->id],
nr_pl,
_ops);
if (IS_ERR(power_zone)) {
-   pr_debug("fai

[PATCH 19/19] perf/x86/intel/rapl: Cosmetic rename internal variables in response to multi-die/pkg support

2019-05-13 Thread Len Brown

From: Kan Liang 

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names
to use "die" terminology, instead of "pkg".

For previous platforms which doesn't have multi-die, "die" is identical
as "pkg".

Signed-off-by: Kan Liang 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/rapl.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 6f5331271563..3992b0e65a55 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -148,7 +148,7 @@ struct rapl_pmu {
 
 struct rapl_pmus {
struct pmu  pmu;
-   unsigned intmaxpkg;
+   unsigned intmaxdie;
struct rapl_pmu *pmus[];
 };
 
@@ -161,13 +161,13 @@ static u64 rapl_timer_ms;
 
 static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
 {
-   unsigned int pkgid = topology_logical_die_id(cpu);
+   unsigned int dieid = topology_logical_die_id(cpu);
 
/*
 * The unsigned check also catches the '-1' return value for non
 * existent mappings in the topology map.
 */
-   return pkgid < rapl_pmus->maxpkg ? rapl_pmus->pmus[pkgid] : NULL;
+   return dieid < rapl_pmus->maxdie ? rapl_pmus->pmus[dieid] : NULL;
 }
 
 static inline u64 rapl_read_counter(struct perf_event *event)
@@ -668,22 +668,22 @@ static void cleanup_rapl_pmus(void)
 {
int i;
 
-   for (i = 0; i < rapl_pmus->maxpkg; i++)
+   for (i = 0; i < rapl_pmus->maxdie; i++)
kfree(rapl_pmus->pmus[i]);
kfree(rapl_pmus);
 }
 
 static int __init init_rapl_pmus(void)
 {
-   int maxpkg = topology_max_packages() * topology_max_die_per_package();
+   int maxdie = topology_max_packages() * topology_max_die_per_package();
size_t size;
 
-   size = sizeof(*rapl_pmus) + maxpkg * sizeof(struct rapl_pmu *);
+   size = sizeof(*rapl_pmus) + maxdie * sizeof(struct rapl_pmu *);
rapl_pmus = kzalloc(size, GFP_KERNEL);
if (!rapl_pmus)
return -ENOMEM;
 
-   rapl_pmus->maxpkg   = maxpkg;
+   rapl_pmus->maxdie   = maxdie;
rapl_pmus->pmu.attr_groups  = rapl_attr_groups;
rapl_pmus->pmu.task_ctx_nr  = perf_invalid_context;
rapl_pmus->pmu.event_init   = rapl_pmu_event_init;
-- 
2.18.0-rc0

[PATCH 14/19] perf/x86/intel/rapl: Support multi-die/package

2019-05-13 Thread Len Brown

From: Kan Liang 

RAPL becomes die-scope on Xeon Cascade Lake-AP. Perf RAPL driver needs
to support die-scope RAPL domain.

Use topology_logical_die_id() to replace topology_logical_package_id().
For previous platforms which doesn't have multi-die,
topology_logical_die_id() is identical as topology_logical_package_id().

Use topology_die_cpumask() to replace topology_core_cpumask().
For previous platforms which doesn't have multi-die,
topology_die_cpumask() is identical as topology_core_cpumask().

There is no functional change for previous platforms.

Signed-off-by: Kan Liang 
Cc: Peter Zijlstra 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/rapl.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 37ebf6fc5415..6f5331271563 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -161,7 +161,7 @@ static u64 rapl_timer_ms;
 
 static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
 {
-   unsigned int pkgid = topology_logical_package_id(cpu);
+   unsigned int pkgid = topology_logical_die_id(cpu);
 
/*
 * The unsigned check also catches the '-1' return value for non
@@ -571,7 +571,7 @@ static int rapl_cpu_offline(unsigned int cpu)
 
pmu->cpu = -1;
/* Find a new cpu to collect rapl events */
-   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+   target = cpumask_any_but(topology_die_cpumask(cpu), cpu);
 
/* Migrate rapl events to the new target */
if (target < nr_cpu_ids) {
@@ -598,14 +598,14 @@ static int rapl_cpu_online(unsigned int cpu)
pmu->timer_interval = ms_to_ktime(rapl_timer_ms);
rapl_hrtimer_init(pmu);
 
-   rapl_pmus->pmus[topology_logical_package_id(cpu)] = pmu;
+   rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu;
}
 
/*
 * Check if there is an online cpu in the package which collects rapl
 * events already.
 */
-   target = cpumask_any_and(_cpu_mask, topology_core_cpumask(cpu));
+   target = cpumask_any_and(_cpu_mask, topology_die_cpumask(cpu));
if (target < nr_cpu_ids)
return 0;
 
@@ -675,7 +675,7 @@ static void cleanup_rapl_pmus(void)
 
 static int __init init_rapl_pmus(void)
 {
-   int maxpkg = topology_max_packages();
+   int maxpkg = topology_max_packages() * topology_max_die_per_package();
size_t size;
 
size = sizeof(*rapl_pmus) + maxpkg * sizeof(struct rapl_pmu *);
-- 
2.18.0-rc0

[PATCH 0/19] v6 multi-die/package topology support

2019-05-13 Thread Len Brown



This patch series does 4 things.

1. Parse the new CPUID.1F leaf to discover multi-die/package topology

2. Export multi-die topology inside the kernel

3. Update 4 places (coretemp, pkgtemp, rapl, perf) that that need to know
   the difference between die and package-scope MSR.

4. Export multi-die topology to user-space via sysfs

These changes should have no impact on cache topology,
NUMA topology, Linux scheduler, or system performance.

These topology changes primarily impact parts of the kernel
and some applications that care about package MSR scope.
Also, some software is licensed per package, and other tools,
such as benchmark reporting software sometimes cares about packages.

Changes since v5:

The patch numbers have decremented by 3, since the first 3 patches
in the original series are now upstream.

The last two "Cosmetic rename" patches have been replaced with
those by Kan, who did a more thorough variable rename than I had
originally proposed.


[PATCH 18/19] perf/x86/intel/uncore: Cosmetic renames in response to 
multi-die/pkg support
[PATCH 19/19] perf/x86/intel/rapl: Cosmetic rename internal variables 
in response to multi-die/pkg support

I'm not aware of any outstanding feedback on this series,
functional or cosmetic.

thanks,
Len Brown, Intel Opensource Technology Center

--
The following changes since commit a13f0655503a4a89df67fdc7cac6a7810795d4b3:

  Merge tag 'iommu-updates-v5.2' of 
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu (2019-05-13 
09:23:18 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86

for you to fetch changes up to 0ddb97e121397d37933233da303556141814fa47:

  perf/x86/intel/rapl: Cosmetic rename internal variables in response to 
multi-die/pkg support (2019-05-13 13:41:50 -0400)


Kan Liang (5):
  perf/x86/intel/uncore: Support multi-die/package
  perf/x86/intel/rapl: Support multi-die/package
  perf/x86/intel/cstate: Support multi-die/package
  perf/x86/intel/uncore: Cosmetic renames in response to multi-die/pkg 
support
  perf/x86/intel/rapl: Cosmetic rename internal variables in response to 
multi-die/pkg support

Len Brown (9):
  x86 topology: Add CPUID.1F multi-die/package support
  x86 topology: Create topology_max_die_per_package()
  cpu topology: Export die_id
  x86 topology: Define topology_die_id()
  x86 topology: Define topology_logical_die_id()
  topology: Create package_cpus sysfs attribute
  topology: Create core_cpus and die_cpus sysfs attributes
  thermal/x86_pkg_temp_thermal: Cosmetic: Rename internal variables to 
zones from packages
  hwmon/coretemp: Cosmetic: Rename internal variables to zones from packages

Zhang Rui (5):
  powercap/intel_rapl: Simplify rapl_find_package()
  powercap/intel_rapl: Support multi-die/package
  thermal/x86_pkg_temp_thermal: Support multi-die/package
  powercap/intel_rapl: Update RAPL domain name and debug messages
  hwmon/coretemp: Support multi-die/package

 Documentation/cputopology.txt|  48 ++---
 Documentation/x86/topology.rst   |   4 +
 arch/x86/events/intel/cstate.c   |  14 ++-
 arch/x86/events/intel/rapl.c |  20 ++--
 arch/x86/events/intel/uncore.c   |  80 +++
 arch/x86/events/intel/uncore.h   |   4 +-
 arch/x86/events/intel/uncore_snbep.c |   4 +-
 arch/x86/include/asm/processor.h |   4 +-
 arch/x86/include/asm/smp.h   |   1 +
 arch/x86/include/asm/topology.h  |  17 
 arch/x86/kernel/cpu/common.c |   1 +
 arch/x86/kernel/cpu/topology.c   |  88 +
 arch/x86/kernel/smpboot.c|  69 +
 arch/x86/xen/smp_pv.c|   1 +
 drivers/base/topology.c  |  22 +
 drivers/hwmon/coretemp.c |  36 +++
 drivers/powercap/intel_rapl.c|  75 +++---
 drivers/thermal/intel/x86_pkg_temp_thermal.c | 142 ++-
 include/linux/topology.h |   6 ++
 19 files changed, 422 insertions(+), 214 deletions(-)

[PATCH 11/19] topology: Create package_cpus sysfs attribute

2019-05-13 Thread Len Brown

From: Len Brown 

The existing sysfs cpu/topology/core_siblings (and core_siblings_list)
attributes are documented, implemented, and used by programs to
represent set of logical CPUs sharing the same package.

This makes sense if the next topology level above a core
is always a package.  But on systems where there is a die
topology level between a core and a package, the name
and its definition become inconsistent.

So without changing its function, add a name for this map
that describes what it actually is -- package CPUs --
the set of CPUs that share the same package.

This new name will be immune to changes in topology, since
it describes threads at the current level, not siblings
at a contained level.

Signed-off-by: Len Brown 
Suggested-by: Brice Goglin 
---
 Documentation/cputopology.txt | 12 ++--
 drivers/base/topology.c   |  6 ++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 2ff8a1e9a2db..48af5c290e20 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -46,15 +46,15 @@ thread_siblings_list:
human-readable list of cpuX's hardware threads within the same
core as cpuX.
 
-core_siblings:
+package_cpus:
 
-   internal kernel map of cpuX's hardware threads within the same
-   physical_package_id.
+   internal kernel map of the CPUs sharing the same physical_package_id.
+   (deprecated name: "core_siblings")
 
-core_siblings_list:
+package_cpus_list:
 
-   human-readable list of cpuX's hardware threads within the same
-   physical_package_id.
+   human-readable list of CPUs sharing the same physical_package_id.
+   (deprecated name: "core_siblings_list")
 
 book_siblings:
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 50352cf96f85..dc3c19b482f3 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -57,6 +57,10 @@ define_siblings_show_func(core_siblings, core_cpumask);
 static DEVICE_ATTR_RO(core_siblings);
 static DEVICE_ATTR_RO(core_siblings_list);
 
+define_siblings_show_func(package_cpus, core_cpumask);
+static DEVICE_ATTR_RO(package_cpus);
+static DEVICE_ATTR_RO(package_cpus_list);
+
 #ifdef CONFIG_SCHED_BOOK
 define_id_show_func(book_id);
 static DEVICE_ATTR_RO(book_id);
@@ -81,6 +85,8 @@ static struct attribute *default_attrs[] = {
_attr_thread_siblings_list.attr,
_attr_core_siblings.attr,
_attr_core_siblings_list.attr,
+   _attr_package_cpus.attr,
+   _attr_package_cpus_list.attr,
 #ifdef CONFIG_SCHED_BOOK
_attr_book_id.attr,
_attr_book_siblings.attr,
-- 
2.18.0-rc0

[PATCH 04/19] x86 topology: Define topology_die_id()

2019-05-13 Thread Len Brown

From: Len Brown 

topology_die_id(cpu) is a simple macro for use inside the kernel
to get the die_id associated with the given cpu.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/topology.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index e0232f7042c3..3777dbe9c0ff 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
+#define topology_die_id(cpu)   (cpu_data(cpu).cpu_die_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
 #ifdef CONFIG_SMP
-- 
2.18.0-rc0

[PATCH 07/19] powercap/intel_rapl: Support multi-die/package

2019-05-13 Thread Len Brown

From: Zhang Rui 

RAPL "package" domains are actually implemented in hardware per-die.
Thus, the new multi-die/package systems have mulitple domains
within each physical package.

Update the intel_rapl driver to be "die aware" -- exporting multiple
domains within a single package, when present.
No change on single die/package systems.

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
Acked-by: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
---
 drivers/powercap/intel_rapl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 3c3c0c23180b..9202dbcef96d 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -266,7 +266,7 @@ static struct rapl_domain *platform_rapl_domain; /* 
Platform (PSys) domain */
 /* caller to ensure CPU hotplug lock is held */
 static struct rapl_package *rapl_find_package_domain(int cpu)
 {
-   int id = topology_physical_package_id(cpu);
+   int id = topology_logical_die_id(cpu);
struct rapl_package *rp;
 
list_for_each_entry(rp, _packages, plist) {
@@ -1459,7 +1459,7 @@ static void rapl_remove_package(struct rapl_package *rp)
 /* called from CPU hotplug notifier, hotplug lock held */
 static struct rapl_package *rapl_add_package(int cpu)
 {
-   int id = topology_physical_package_id(cpu);
+   int id = topology_logical_die_id(cpu);
struct rapl_package *rp;
int ret;
 
-- 
2.18.0-rc0

Re: [PATCH 21/22] perf/x86/intel/uncore: renames in response to multi-die/pkg support

2019-05-13 Thread Len Brown

On Thu, May 9, 2019 at 11:02 AM Liang, Kan  wrote:

>
> I think the "box" terminology in perf uncore has different meaning. It
> stands for an uncore PMU unit on a socket/die.
> I think it may be better use "die" to replace the "pkg".
> How about the patch as below?

Also fine with me.
And I've replaced my rename patch with yours here too.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH 22/22] perf/x86/intel/rapl: rename internal variables in response to multi-die/pkg support

2019-05-13 Thread Len Brown

On Thu, May 9, 2019 at 11:04 AM Liang, Kan  wrote:

> The perf rapl "pmu" in the code is cross the pkg/die. We only register
> one rapl pmu for whole system for now.
> I think it may be better use "die" to replace the "pkg" as well.
> How about the patch as below?

Fine with me!
I've replaced my patch with yours in the series.

thanks,
Len Brown, Intel Open Source Technology Center

Re: [PATCH 16/22] perf/x86/intel/uncore: Support multi-die/package

2019-05-08 Thread Len Brown

On Tue, May 7, 2019 at 8:22 AM Peter Zijlstra  wrote:
>
> On Mon, May 06, 2019 at 05:26:11PM -0400, Len Brown wrote:
> > @@ -1223,7 +1225,7 @@ static int uncore_event_cpu_online(unsigned int cpu)
> >   struct intel_uncore_box *box;
> >   int i, ret, pkg, target;
> >
> > - pkg = topology_logical_package_id(cpu);
> > + pkg = topology_logical_die_id(cpu);
> >   ret = allocate_boxes(types, pkg, cpu);
> >   if (ret)
> >   return ret;
>
> 'pkg' != 'die'

To keep the patch with this functional change minimal and obvious,
the rename of this variable was segregated into patch 22, which is
re-names only.

Len Brown, Intel Open Source Technology Center

Re: [PATCH 16/22] perf/x86/intel/uncore: Support multi-die/package

2019-05-08 Thread Len Brown

On Tue, May 7, 2019 at 8:21 AM Peter Zijlstra  wrote:
>
> On Mon, May 06, 2019 at 05:26:11PM -0400, Len Brown wrote:
> > @@ -1411,7 +1413,7 @@ static int __init intel_uncore_init(void)
> >   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> >   return -ENODEV;
> >
> > - max_packages = topology_max_packages();
> > + max_packages = topology_max_packages() * 
> > topology_max_die_per_package();
>
> That's max_dies..

To keep the patch with this functional change minimal and obvious,
the rename of this variable was segregated into patch 22, which is
re-names only.

Len Brown, Intel Open Source Technology Center

Re: [PATCH 10/22] powercap/intel_rapl: Support multi-die/package

2019-05-08 Thread Len Brown

On Tue, May 7, 2019 at 8:15 AM Peter Zijlstra  wrote:
>
> On Mon, May 06, 2019 at 05:26:05PM -0400, Len Brown wrote:
> > From: Zhang Rui 
> >
> > RAPL "package" domains are actually implemented in hardware per-die.
> > Thus, the new multi-die/package systems have mulitple domains
> > within each physical package.
> >
> > Update the intel_rapl driver to be "die aware" -- exporting multiple
> > domains within a single package, when present.
> > No change on single die/package systems.
> >
> > Signed-off-by: Zhang Rui 
> > Signed-off-by: Len Brown 
> > Acked-by: Rafael J. Wysocki 
> > Cc: linux...@vger.kernel.org
> > ---
> >  drivers/powercap/intel_rapl.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
> > index 3c3c0c23180b..9202dbcef96d 100644
> > --- a/drivers/powercap/intel_rapl.c
> > +++ b/drivers/powercap/intel_rapl.c
> > @@ -266,7 +266,7 @@ static struct rapl_domain *platform_rapl_domain; /* 
> > Platform (PSys) domain */
> >  /* caller to ensure CPU hotplug lock is held */
> >  static struct rapl_package *rapl_find_package_domain(int cpu)
> >  {
> > - int id = topology_physical_package_id(cpu);
> > + int id = topology_logical_die_id(cpu);
> >   struct rapl_package *rp;

> Both functions are still misnomers. rapl_find_package_domain() does in
> fact now do rapl_find_die_domain(), right? Same for rapl_add_package()

A "RAPL Package Domain" (rapl_package, above) is a known proper noun --
it is a known documented capability.

When there could be just 1 die in a package, the name of this capability
also always matched the scope of a physical package.

Now that some products have two die in the same package, there
can be two of these inside a package, and they are associated with
each die.

There are no plans to re-name the Package RAPL Domain capability
in the hardware documentation.

Similarly, there are no plans to re-name any of the other "PACKAGE"
scoped MSRs to have "DIE" in their name instead.  The ship with
those names sailed long ago.

I think the code above reflects its function, and that somebody maintaining
it will be clear on this.  That is important, because in the future, there will
be a concept of PACKAGE scoped MSRs that span multiple DIE...

cheers,
Len Brown, Intel Open Source Technology Center

[PATCH 04/22] x86 topology: Add CPUID.1F multi-die/package support

2019-05-06 Thread Len Brown

From: Len Brown 

Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf"
to handle either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.

Signed-off-by: Len Brown 
Cc: linux-...@vger.kernel.org
---
 Documentation/x86/topology.txt   |  4 ++
 arch/x86/include/asm/processor.h |  4 +-
 arch/x86/kernel/cpu/topology.c   | 85 +---
 arch/x86/kernel/smpboot.c|  2 +
 4 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/Documentation/x86/topology.txt b/Documentation/x86/topology.txt
index 06b3cdbc4048..8107b6cfc9ea 100644
--- a/Documentation/x86/topology.txt
+++ b/Documentation/x86/topology.txt
@@ -46,6 +46,10 @@ The topology of a system is described in the units of:
 
 The number of cores in a package. This information is retrieved via CPUID.
 
+  - cpuinfo_x86.x86_max_dies:
+
+The number of dies in a package. This information is retrieved via CPUID.
+
   - cpuinfo_x86.phys_proc_id:
 
 The physical ID of the package. This information is retrieved via CPUID
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..2507edc30cc2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -105,7 +105,8 @@ struct cpuinfo_x86 {
int x86_power;
unsigned long   loops_per_jiffy;
/* cpuid returned max cores value: */
-   u16  x86_max_cores;
+   u16 x86_max_cores;
+   u16 x86_max_dies;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
@@ -117,6 +118,7 @@ struct cpuinfo_x86 {
u16 logical_proc_id;
/* Core id: */
u16 cpu_core_id;
+   u16 cpu_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
u32 microcode;
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 8f6c784141d1..4d17e699657d 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -15,33 +15,63 @@
 /* leaf 0xb SMT level */
 #define SMT_LEVEL  0
 
-/* leaf 0xb sub-leaf types */
+/* extended topology sub-leaf types */
 #define INVALID_TYPE   0
 #define SMT_TYPE   1
 #define CORE_TYPE  2
+#define DIE_TYPE   5
 
 #define LEAFB_SUBTYPE(ecx) (((ecx) >> 8) & 0xff)
 #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f)
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
-int detect_extended_topology_early(struct cpuinfo_x86 *c)
-{
 #ifdef CONFIG_SMP
+/*
+ * Check if given CPUID extended toplogy "leaf" is implemented
+ */
+static int check_extended_topology_leaf(int leaf)
+{
unsigned int eax, ebx, ecx, edx;
 
-   if (c->cpuid_level < 0xb)
+   cpuid_count(leaf, SMT_LEVEL, , , , );
+
+   if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))
return -1;
 
-   cpuid_count(0xb, SMT_LEVEL, , , , );
+   return 0;
+}
+/*
+ * Return best CPUID Extended Toplogy Leaf supported
+ */
+static int detect_extended_topology_leaf(struct cpuinfo_x86 *c)
+{
+   if (c->cpuid_level >= 0x1f) {
+   if (check_extended_topology_leaf(0x1f) == 0)
+   return 0x1f;
+   }
 
-   /*
-* check if the cpuid leaf 0xb is actually implemented.
-*/
-   if (ebx == 0 || (LEAFB_SUBTYPE(ecx) != SMT_TYPE))
+   if (c->cpuid_level >= 0xb) {
+   if (check_extended_topology_leaf(0xb) == 0)
+   return 0xb;
+   }
+
+   return -1;
+}
+#endif
+
+int detect_extended_topology_early(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   unsigned int eax, ebx, ecx, edx;
+   int leaf;
+
+   leaf = detect_extended_topology_leaf(c);
+   if (leaf < 0)
return -1;
 
set_cpu_cap(c, X86_FEATURE_XTOPOLOGY);
 
+   cpuid_count(leaf, SMT_LEVEL, , , , );
/*
 * initial apic id, which also represents 32-bit extended x2apic id.
 */
@@ -52,7 +82,7 @@ int detect_extended_topology_early(struct cpuinfo_x86 *c)
 }
 
 /*
- * Check for extended topology enumeration cpuid leaf 0xb and if it
+ * Check for extended topology enumeration cpuid leaf, and if it
  * exists, use it for populating initial_apicid and cpu topology
  * detection.
  */
@@ -60,22 +90,28 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_SMP
unsigned int eax, ebx, ecx, edx, sub_index;
-   unsigned int ht_mask_width, core_plus_mask_width;
+   unsigned int ht_mask_width, core_plus_mask_width,

[PATCH 05/22] x86 topology: Create topology_max_die_per_package()

2019-05-06 Thread Len Brown

From: Len Brown 

topology_max_packages() is available to size resources to
cover all packages in the system.

But now we have multi-die/package systems, and some
resources are per-die.

Create topology_max_die_per_package(), for detecting
multi-die/package systems, and sizing any per-die resources.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/processor.h |  1 -
 arch/x86/include/asm/topology.h  | 10 ++
 arch/x86/kernel/cpu/topology.c   |  5 -
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2507edc30cc2..5f45488b1a9d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -106,7 +106,6 @@ struct cpuinfo_x86 {
unsigned long   loops_per_jiffy;
/* cpuid returned max cores value: */
u16 x86_max_cores;
-   u16 x86_max_dies;
u16 apicid;
u16 initial_apicid;
u16 x86_clflush_size;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38a1c33..e0232f7042c3 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -115,6 +115,13 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()(__max_logical_packages)
 
+extern unsigned int __max_die_per_package;
+
+static inline int topology_max_die_per_package(void)
+{
+   return __max_die_per_package;
+}
+
 extern int __max_smt_threads;
 
 static inline int topology_max_smt_threads(void)
@@ -131,6 +138,9 @@ bool topology_smt_supported(void);
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
+static inline int topology_phys_to_logical_die(unsigned int die,
+   unsigned int cpu) { return 0; }
+static inline int topology_max_die_per_package(void) { return 1; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
 static inline bool topology_smt_supported(void) { return false; }
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index 4d17e699657d..ee48c3fc8a65 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -26,6 +26,9 @@
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
 #ifdef CONFIG_SMP
+unsigned int __max_die_per_package __read_mostly = 1;
+EXPORT_SYMBOL(__max_die_per_package);
+
 /*
  * Check if given CPUID extended toplogy "leaf" is implemented
  */
@@ -146,7 +149,7 @@ int detect_extended_topology(struct cpuinfo_x86 *c)
c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
 
c->x86_max_cores = (core_level_siblings / smp_num_siblings);
-   c->x86_max_dies = (die_level_siblings / core_level_siblings);
+   __max_die_per_package = (die_level_siblings / core_level_siblings);
 #endif
return 0;
 }
-- 
2.18.0-rc0

[PATCH 08/22] x86 topology: Define topology_logical_die_id()

2019-05-06 Thread Len Brown

From: Len Brown 

Define topology_logical_die_id() ala
existing topology_logical_package_id()

Tested-by: Zhang Rui 
Signed-off-by: Len Brown 
---
 arch/x86/include/asm/processor.h |  1 +
 arch/x86/include/asm/topology.h  |  5 
 arch/x86/kernel/cpu/common.c |  1 +
 arch/x86/kernel/smpboot.c| 45 
 4 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 5f45488b1a9d..def963d0b805 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -118,6 +118,7 @@ struct cpuinfo_x86 {
/* Core id: */
u16 cpu_core_id;
u16 cpu_die_id;
+   u16 logical_die_id;
/* Index into per_cpu list: */
u16 cpu_index;
u32 microcode;
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 3777dbe9c0ff..9de16b4f6023 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)   (cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).phys_proc_id)
+#define topology_logical_die_id(cpu)   (cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)   (cpu_data(cpu).cpu_die_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
@@ -131,13 +132,17 @@ static inline int topology_max_smt_threads(void)
 }
 
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
+int topology_update_die_map(unsigned int dieid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
+int topology_phys_to_logical_die(unsigned int die, unsigned int cpu);
 bool topology_is_primary_thread(unsigned int cpu);
 bool topology_smt_supported(void);
 #else
 #define topology_max_packages()(1)
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
+static inline int
+topology_update_die_map(unsigned int dieid, unsigned int cpu) { return 0; }
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_phys_to_logical_die(unsigned int die,
unsigned int cpu) { return 0; }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..24f96c9771af 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1285,6 +1285,7 @@ static void validate_apic_and_package_id(struct 
cpuinfo_x86 *c)
   cpu, apicid, c->initial_apicid);
}
BUG_ON(topology_update_package_map(c->phys_proc_id, cpu));
+   BUG_ON(topology_update_die_map(c->cpu_die_id, cpu));
 #else
c->logical_proc_id = 0;
 #endif
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 05f9cfdddffd..a114375e14f7 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,7 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);
 unsigned int __max_logical_packages __read_mostly;
 EXPORT_SYMBOL(__max_logical_packages);
 static unsigned int logical_packages __read_mostly;
+static unsigned int logical_die __read_mostly;
 
 /* Maximum number of SMT threads on any online core */
 int __read_mostly __max_smt_threads = 1;
@@ -302,6 +303,26 @@ int topology_phys_to_logical_pkg(unsigned int phys_pkg)
return -1;
 }
 EXPORT_SYMBOL(topology_phys_to_logical_pkg);
+/**
+ * topology_phys_to_logical_die - Map a physical die id to logical
+ *
+ * Returns logical die id or -1 if not found
+ */
+int topology_phys_to_logical_die(unsigned int die_id, unsigned int cur_cpu)
+{
+   int cpu;
+   int proc_id = cpu_data(cur_cpu).phys_proc_id;
+
+   for_each_possible_cpu(cpu) {
+   struct cpuinfo_x86 *c = _data(cpu);
+
+   if (c->initialized && c->cpu_die_id == die_id &&
+   c->phys_proc_id == proc_id)
+   return c->logical_die_id;
+   }
+   return -1;
+}
+EXPORT_SYMBOL(topology_phys_to_logical_die);
 
 /**
  * topology_update_package_map - Update the physical to logical package map
@@ -326,6 +347,29 @@ int topology_update_package_map(unsigned int pkg, unsigned 
int cpu)
cpu_data(cpu).logical_proc_id = new;
return 0;
 }
+/**
+ * topology_update_die_map - Update the physical to logical die map
+ * @die:   The die id as retrieved via CPUID
+ * @cpu:   The cpu for which this is updated
+ */
+int topology_update_die_map(unsigned int die, unsigned int cpu)
+{
+   int new;
+
+   /* Already available somewhere? */
+   new = topology_phys_to_logical_die(die, cpu);
+   if (new >= 0)
+   goto found;
+
+   new = logical_die++;
+   if (new != die) {

[PATCH 15/22] topology: Create core_cpus and die_cpus sysfs attributes

2019-05-06 Thread Len Brown

From: Len Brown 

Create CPU topology sysfs attributes: "core_cpus" and "core_cpus_list"

These attributes represent all of the logical CPUs that share the
same core.

These attriutes is synonymous with the existing "thread_siblings" and
"thread_siblings_list" attribute, which will be deprecated.

Create CPU topology sysfs attributes: "die_cpus" and "die_cpus_list".
These attributes represent all of the logical CPUs that share the
same die.

Signed-off-by: Len Brown 
Suggested-by: Brice Goglin 
---
 Documentation/cputopology.txt   | 21 +++--
 arch/x86/include/asm/smp.h  |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c   | 22 ++
 arch/x86/xen/smp_pv.c   |  1 +
 drivers/base/topology.c | 12 
 include/linux/topology.h|  3 +++
 7 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 48af5c290e20..b90dafcc8237 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -36,15 +36,15 @@ drawer_id:
identifier (rather than the kernel's).  The actual value is
architecture and platform dependent.
 
-thread_siblings:
+core_cpus:
 
-   internal kernel map of cpuX's hardware threads within the same
-   core as cpuX.
+   internal kernel map of CPUs within the same core.
+   (deprecated name: "thread_siblings")
 
-thread_siblings_list:
+core_cpus_list:
 
-   human-readable list of cpuX's hardware threads within the same
-   core as cpuX.
+   human-readable list of CPUs within the same core.
+   (deprecated name: "thread_siblings_list");
 
 package_cpus:
 
@@ -56,6 +56,14 @@ package_cpus_list:
human-readable list of CPUs sharing the same physical_package_id.
(deprecated name: "core_siblings_list")
 
+die_cpus:
+
+   internal kernel map of CPUs within the same die.
+
+die_cpus_list:
+
+   human-readable list of CPUs within the same die.
+
 book_siblings:
 
internal kernel map of cpuX's hardware threads within the same
@@ -93,6 +101,7 @@ these macros in include/asm-XXX/topology.h::
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
+   #define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 2e95b6c1bca3..39266d193597 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -23,6 +23,7 @@ extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9de16b4f6023..4b14d2318251 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #define topology_core_id(cpu)  (cpu_data(cpu).cpu_core_id)
 
 #ifdef CONFIG_SMP
+#define topology_die_cpumask(cpu)  (per_cpu(cpu_die_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a114375e14f7..48a671443266 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -91,6 +91,10 @@ EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
+/* representing HT, core, and die siblings of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
+EXPORT_PER_CPU_SYMBOL(cpu_die_map);
+
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
 /* Per CPU bogomips and other parameters */
@@ -509,6 +513,15 @@ static bool match_pkg(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
return false;
 }
 
+static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+   if ((c->phys_proc_id == o->phys_proc_id) &&
+   (c->cpu_die_id == o->cpu_die_id))
+   return true;
+   return false;
+}
+
+
 #if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
@@ -571,6 +584,7 @@ void set_cpu_sibling_map(int cpu)
cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
c

[PATCH 03/22] x86 smpboot: Rename match_die() to match_pkg()

2019-05-06 Thread Len Brown

From: Len Brown 

Syntax only, no functional or semantic change.

This routine matches packages, not die, so name it thus.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/smpboot.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ce1a67b70168..3f8bbfb26c18 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -455,7 +455,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct 
cpuinfo_x86 *o)
  * multicore group inside a NUMA node.  If this happens, we will
  * discard the MC level of the topology later.
  */
-static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+static bool match_pkg(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
if (c->phys_proc_id == o->phys_proc_id)
return true;
@@ -546,7 +546,7 @@ void set_cpu_sibling_map(int cpu)
for_each_cpu(i, cpu_sibling_setup_mask) {
o = _data(i);
 
-   if ((i == cpu) || (has_mp && match_die(c, o))) {
+   if ((i == cpu) || (has_mp && match_pkg(c, o))) {
link_mask(topology_core_cpumask, cpu, i);
 
/*
@@ -570,7 +570,7 @@ void set_cpu_sibling_map(int cpu)
} else if (i != cpu && !c->booted_cores)
c->booted_cores = cpu_data(i).booted_cores;
}
-   if (match_die(c, o) && !topology_same_node(c, o))
+   if (match_pkg(c, o) && !topology_same_node(c, o))
x86_has_numa_in_package = true;
}
 
-- 
2.18.0-rc0

[PATCH 09/22] powercap/intel_rapl: Simplify rapl_find_package()

2019-05-06 Thread Len Brown

From: Zhang Rui 

Syntax only, no functional or semantic change.

Simplify how the code to discover a package is called.
Rename find_package_by_id() to rapl_find_package_domain()

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
Acked-by: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
---
 drivers/powercap/intel_rapl.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 4347f15165f8..3c3c0c23180b 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -264,8 +264,9 @@ static struct powercap_control_type *control_type; /* 
PowerCap Controller */
 static struct rapl_domain *platform_rapl_domain; /* Platform (PSys) domain */
 
 /* caller to ensure CPU hotplug lock is held */
-static struct rapl_package *find_package_by_id(int id)
+static struct rapl_package *rapl_find_package_domain(int cpu)
 {
+   int id = topology_physical_package_id(cpu);
struct rapl_package *rp;
 
list_for_each_entry(rp, _packages, plist) {
@@ -1300,7 +1301,7 @@ static int __init rapl_register_psys(void)
rd->rpl[0].name = pl1_name;
rd->rpl[1].prim_id = PL2_ENABLE;
rd->rpl[1].name = pl2_name;
-   rd->rp = find_package_by_id(0);
+   rd->rp = rapl_find_package_domain(0);
 
power_zone = powercap_register_zone(>power_zone, control_type,
"psys", NULL,
@@ -1456,8 +1457,9 @@ static void rapl_remove_package(struct rapl_package *rp)
 }
 
 /* called from CPU hotplug notifier, hotplug lock held */
-static struct rapl_package *rapl_add_package(int cpu, int pkgid)
+static struct rapl_package *rapl_add_package(int cpu)
 {
+   int id = topology_physical_package_id(cpu);
struct rapl_package *rp;
int ret;
 
@@ -1466,7 +1468,7 @@ static struct rapl_package *rapl_add_package(int cpu, int 
pkgid)
return ERR_PTR(-ENOMEM);
 
/* add the new package to the list */
-   rp->id = pkgid;
+   rp->id = id;
rp->lead_cpu = cpu;
 
/* check if the package contains valid domains */
@@ -1497,12 +1499,11 @@ static struct rapl_package *rapl_add_package(int cpu, 
int pkgid)
  */
 static int rapl_cpu_online(unsigned int cpu)
 {
-   int pkgid = topology_physical_package_id(cpu);
struct rapl_package *rp;
 
-   rp = find_package_by_id(pkgid);
+   rp = rapl_find_package_domain(cpu);
if (!rp) {
-   rp = rapl_add_package(cpu, pkgid);
+   rp = rapl_add_package(cpu);
if (IS_ERR(rp))
return PTR_ERR(rp);
}
@@ -1512,11 +1513,10 @@ static int rapl_cpu_online(unsigned int cpu)
 
 static int rapl_cpu_down_prep(unsigned int cpu)
 {
-   int pkgid = topology_physical_package_id(cpu);
struct rapl_package *rp;
int lead_cpu;
 
-   rp = find_package_by_id(pkgid);
+   rp = rapl_find_package_domain(cpu);
if (!rp)
return 0;
 
-- 
2.18.0-rc0

[PATCH 11/22] thermal/x86_pkg_temp_thermal: Support multi-die/package

2019-05-06 Thread Len Brown

From: Zhang Rui 

Package temperature sensors are actually implemented in hardware per-die.
Thus, the new multi-die/package systems sport mulitple package thermal
zones for each package.

Update the x86_pkg_temp_thermal to be "multi-die-aware", so it can
expose multiple zones per package, instead of just one.

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
---
 drivers/thermal/intel/x86_pkg_temp_thermal.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/thermal/intel/x86_pkg_temp_thermal.c 
b/drivers/thermal/intel/x86_pkg_temp_thermal.c
index 1ef937d799e4..405b3858900a 100644
--- a/drivers/thermal/intel/x86_pkg_temp_thermal.c
+++ b/drivers/thermal/intel/x86_pkg_temp_thermal.c
@@ -122,7 +122,7 @@ static int pkg_temp_debugfs_init(void)
  */
 static struct pkg_device *pkg_temp_thermal_get_dev(unsigned int cpu)
 {
-   int pkgid = topology_logical_package_id(cpu);
+   int pkgid = topology_logical_die_id(cpu);
 
if (pkgid >= 0 && pkgid < max_packages)
return packages[pkgid];
@@ -353,7 +353,7 @@ static int pkg_thermal_notify(u64 msr_val)
 
 static int pkg_temp_thermal_device_add(unsigned int cpu)
 {
-   int pkgid = topology_logical_package_id(cpu);
+   int pkgid = topology_logical_die_id(cpu);
u32 tj_max, eax, ebx, ecx, edx;
struct pkg_device *pkgdev;
int thres_count, err;
@@ -449,7 +449,7 @@ static int pkg_thermal_cpu_offline(unsigned int cpu)
 * worker will see the package anymore.
 */
if (lastcpu) {
-   packages[topology_logical_package_id(cpu)] = NULL;
+   packages[topology_logical_die_id(cpu)] = NULL;
/* After this point nothing touches the MSR anymore. */
wrmsr(MSR_IA32_PACKAGE_THERM_INTERRUPT,
  pkgdev->msr_pkg_therm_low, pkgdev->msr_pkg_therm_high);
@@ -515,7 +515,7 @@ static int __init pkg_temp_thermal_init(void)
if (!x86_match_cpu(pkg_temp_thermal_ids))
return -ENODEV;
 
-   max_packages = topology_max_packages();
+   max_packages = topology_max_packages() * topology_max_die_per_package();
packages = kcalloc(max_packages, sizeof(struct pkg_device *),
   GFP_KERNEL);
if (!packages)
-- 
2.18.0-rc0

[PATCH 20/22] hwmon/coretemp: rename internal variables to zones from packages

2019-05-06 Thread Len Brown

From: Len Brown 

Syntax update only -- no logical or functional change.

In response to the new multi-die/package changes, update variable names
to use the more generic thermal "zone" terminology, instead of "package",
as the zones can refer to either packages or die.

Signed-off-by: Len Brown 
Cc: Zhang Rui 
---
 drivers/hwmon/coretemp.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/hwmon/coretemp.c b/drivers/hwmon/coretemp.c
index c64ce32d3214..4ebed90d81aa 100644
--- a/drivers/hwmon/coretemp.c
+++ b/drivers/hwmon/coretemp.c
@@ -109,10 +109,10 @@ struct platform_data {
struct device_attribute name_attr;
 };
 
-/* Keep track of how many package pointers we allocated in init() */
-static int max_packages __read_mostly;
-/* Array of package pointers. Serialized by cpu hotplug lock */
-static struct platform_device **pkg_devices;
+/* Keep track of how many zone pointers we allocated in init() */
+static int max_zones __read_mostly;
+/* Array of zone pointers. Serialized by cpu hotplug lock */
+static struct platform_device **zone_devices;
 
 static ssize_t show_label(struct device *dev,
struct device_attribute *devattr, char *buf)
@@ -435,10 +435,10 @@ static int chk_ucode_version(unsigned int cpu)
 
 static struct platform_device *coretemp_get_pdev(unsigned int cpu)
 {
-   int pkgid = topology_logical_die_id(cpu);
+   int id = topology_logical_die_id(cpu);
 
-   if (pkgid >= 0 && pkgid < max_packages)
-   return pkg_devices[pkgid];
+   if (id >= 0 && id < max_zones)
+   return zone_devices[id];
return NULL;
 }
 
@@ -544,7 +544,7 @@ static int coretemp_probe(struct platform_device *pdev)
struct device *dev = >dev;
struct platform_data *pdata;
 
-   /* Initialize the per-package data structures */
+   /* Initialize the per-zone data structures */
pdata = devm_kzalloc(dev, sizeof(struct platform_data), GFP_KERNEL);
if (!pdata)
return -ENOMEM;
@@ -579,13 +579,13 @@ static struct platform_driver coretemp_driver = {
 
 static struct platform_device *coretemp_device_add(unsigned int cpu)
 {
-   int err, pkgid = topology_logical_die_id(cpu);
+   int err, zoneid = topology_logical_die_id(cpu);
struct platform_device *pdev;
 
-   if (pkgid < 0)
+   if (zoneid < 0)
return ERR_PTR(-ENOMEM);
 
-   pdev = platform_device_alloc(DRVNAME, pkgid);
+   pdev = platform_device_alloc(DRVNAME, zoneid);
if (!pdev)
return ERR_PTR(-ENOMEM);
 
@@ -595,7 +595,7 @@ static struct platform_device *coretemp_device_add(unsigned 
int cpu)
return ERR_PTR(err);
}
 
-   pkg_devices[pkgid] = pdev;
+   zone_devices[zoneid] = pdev;
return pdev;
 }
 
@@ -703,7 +703,7 @@ static int coretemp_cpu_offline(unsigned int cpu)
 * the rest.
 */
if (cpumask_empty(>cpumask)) {
-   pkg_devices[topology_logical_die_id(cpu)] = NULL;
+   zone_devices[topology_logical_die_id(cpu)] = NULL;
platform_device_unregister(pdev);
return 0;
}
@@ -741,10 +741,10 @@ static int __init coretemp_init(void)
if (!x86_match_cpu(coretemp_ids))
return -ENODEV;
 
-   max_packages = topology_max_packages() * topology_max_die_per_package();
-   pkg_devices = kcalloc(max_packages, sizeof(struct platform_device *),
+   max_zones = topology_max_packages() * topology_max_die_per_package();
+   zone_devices = kcalloc(max_zones, sizeof(struct platform_device *),
  GFP_KERNEL);
-   if (!pkg_devices)
+   if (!zone_devices)
return -ENOMEM;
 
err = platform_driver_register(_driver);
@@ -760,7 +760,7 @@ static int __init coretemp_init(void)
 
 outdrv:
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
return err;
 }
 module_init(coretemp_init)
@@ -769,7 +769,7 @@ static void __exit coretemp_exit(void)
 {
cpuhp_remove_state(coretemp_hp_online);
platform_driver_unregister(_driver);
-   kfree(pkg_devices);
+   kfree(zone_devices);
 }
 module_exit(coretemp_exit)
 
-- 
2.18.0-rc0

[PATCH 18/22] perf/x86/intel/cstate: Support multi-die/package

2019-05-06 Thread Len Brown

From: Kan Liang 

Some cstate counters becomes die-scope on Xeon Cascade Lake-AP. Perf
cstate driver needs to support die-scope cstate counters.

Use topology_die_cpumask() to replace topology_core_cpumask().
For previous platforms which doesn't have multi-die,
topology_die_cpumask() is identical as topology_core_cpumask().
There is no functional change for previous platforms.

Name the die-scope PMU "cstate_die".

Signed-off-by: Kan Liang 
Cc: Peter Zijlstra 
Signed-off-by: Len Brown 
---
 arch/x86/events/intel/cstate.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index 94a4b7fc75d0..52c5fea29457 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -302,7 +302,7 @@ static int cstate_pmu_event_init(struct perf_event *event)
return -EINVAL;
event->hw.event_base = pkg_msr[cfg].msr;
cpu = cpumask_any_and(_pkg_cpu_mask,
- topology_core_cpumask(event->cpu));
+ topology_die_cpumask(event->cpu));
} else {
return -ENOENT;
}
@@ -385,7 +385,7 @@ static int cstate_cpu_exit(unsigned int cpu)
if (has_cstate_pkg &&
cpumask_test_and_clear_cpu(cpu, _pkg_cpu_mask)) {
 
-   target = cpumask_any_but(topology_core_cpumask(cpu), cpu);
+   target = cpumask_any_but(topology_die_cpumask(cpu), cpu);
/* Migrate events if there is a valid target */
if (target < nr_cpu_ids) {
cpumask_set_cpu(target, _pkg_cpu_mask);
@@ -414,7 +414,7 @@ static int cstate_cpu_init(unsigned int cpu)
 * in the package cpu mask as the designated reader.
 */
target = cpumask_any_and(_pkg_cpu_mask,
-topology_core_cpumask(cpu));
+topology_die_cpumask(cpu));
if (has_cstate_pkg && target >= nr_cpu_ids)
cpumask_set_cpu(cpu, _pkg_cpu_mask);
 
@@ -661,7 +661,13 @@ static int __init cstate_init(void)
}
 
if (has_cstate_pkg) {
-   err = perf_pmu_register(_pkg_pmu, cstate_pkg_pmu.name, 
-1);
+   if (topology_max_die_per_package() > 1) {
+   err = perf_pmu_register(_pkg_pmu,
+   "cstate_die", -1);
+   } else {
+   err = perf_pmu_register(_pkg_pmu,
+   cstate_pkg_pmu.name, -1);
+   }
if (err) {
has_cstate_pkg = false;
pr_info("Failed to register cstate pkg pmu\n");
-- 
2.18.0-rc0

[PATCH 12/22] powercap/intel_rapl: update rapl domain name and debug messages

2019-05-06 Thread Len Brown

From: Zhang Rui 

The RAPL domain "name" attribute contains "Package-N",
which is ambiguous on multi-die per-package systems.

Update the name to "package-X-die-Y" on those systems.

No change on systems without multi-die/package.

Update driver debug messages.

Signed-off-by: Zhang Rui 
Signed-off-by: Len Brown 
Acked-by: Rafael J. Wysocki 
Cc: linux...@vger.kernel.org
---
 drivers/powercap/intel_rapl.c | 57 ---
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/drivers/powercap/intel_rapl.c b/drivers/powercap/intel_rapl.c
index 9202dbcef96d..ad78c1d08260 100644
--- a/drivers/powercap/intel_rapl.c
+++ b/drivers/powercap/intel_rapl.c
@@ -178,12 +178,15 @@ struct rapl_domain {
 #define power_zone_to_rapl_domain(_zone) \
container_of(_zone, struct rapl_domain, power_zone)
 
+/* maximum rapl package domain name: package-%d-die-%d */
+#define PACKAGE_DOMAIN_NAME_LENGTH 30
 
-/* Each physical package contains multiple domains, these are the common
+
+/* Each rapl package contains multiple domains, these are the common
  * data across RAPL domains within a package.
  */
 struct rapl_package {
-   unsigned int id; /* physical package/socket id */
+   unsigned int id; /* logical die id, equals physical 1-die systems */
unsigned int nr_domains;
unsigned long domain_map; /* bit map of active domains */
unsigned int power_unit;
@@ -198,6 +201,7 @@ struct rapl_package {
int lead_cpu; /* one active cpu per package for access */
/* Track active cpus */
struct cpumask cpumask;
+   char name[PACKAGE_DOMAIN_NAME_LENGTH];
 };
 
 struct rapl_defaults {
@@ -926,8 +930,8 @@ static int rapl_check_unit_core(struct rapl_package *rp, 
int cpu)
value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
rp->time_unit = 100 / (1 << value);
 
-   pr_debug("Core CPU package %d energy=%dpJ, time=%dus, power=%duW\n",
-   rp->id, rp->energy_unit, rp->time_unit, rp->power_unit);
+   pr_debug("Core CPU %s energy=%dpJ, time=%dus, power=%duW\n",
+   rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
 
return 0;
 }
@@ -951,8 +955,8 @@ static int rapl_check_unit_atom(struct rapl_package *rp, 
int cpu)
value = (msr_val & TIME_UNIT_MASK) >> TIME_UNIT_OFFSET;
rp->time_unit = 100 / (1 << value);
 
-   pr_debug("Atom package %d energy=%dpJ, time=%dus, power=%duW\n",
-   rp->id, rp->energy_unit, rp->time_unit, rp->power_unit);
+   pr_debug("Atom %s energy=%dpJ, time=%dus, power=%duW\n",
+   rp->name, rp->energy_unit, rp->time_unit, rp->power_unit);
 
return 0;
 }
@@ -1181,7 +1185,7 @@ static void rapl_update_domain_data(struct rapl_package 
*rp)
u64 val;
 
for (dmn = 0; dmn < rp->nr_domains; dmn++) {
-   pr_debug("update package %d domain %s data\n", rp->id,
+   pr_debug("update %s domain %s data\n", rp->name,
 rp->domains[dmn].name);
/* exclude non-raw primitives */
for (prim = 0; prim < NR_RAW_PRIMITIVES; prim++) {
@@ -1206,7 +1210,6 @@ static void rapl_unregister_powercap(void)
 static int rapl_package_register_powercap(struct rapl_package *rp)
 {
struct rapl_domain *rd;
-   char dev_name[17]; /* max domain name = 7 + 1 + 8 for int + 1 for null*/
struct powercap_zone *power_zone = NULL;
int nr_pl, ret;
 
@@ -1217,20 +1220,16 @@ static int rapl_package_register_powercap(struct 
rapl_package *rp)
for (rd = rp->domains; rd < rp->domains + rp->nr_domains; rd++) {
if (rd->id == RAPL_DOMAIN_PACKAGE) {
nr_pl = find_nr_power_limit(rd);
-   pr_debug("register socket %d package domain %s\n",
-   rp->id, rd->name);
-   memset(dev_name, 0, sizeof(dev_name));
-   snprintf(dev_name, sizeof(dev_name), "%s-%d",
-   rd->name, rp->id);
+   pr_debug("register package domain %s\n", rp->name);
power_zone = powercap_register_zone(>power_zone,
control_type,
-   dev_name, NULL,
+   rp->name, NULL,
_ops[rd->id],
nr_pl,
_ops);
if (IS_ERR(power_zone)) {
-   pr_debug("fai

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2316 matches

Mail list logo