Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Paolo Bonzini


On 03/11/2015 15:35, Xiao Guangrong wrote:
>
> -if ((cr0 ^ old_cr0) & X86_CR0_CD)
> +if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) &&
> +(cr0 ^ old_cr0) & X86_CR0_CD)
>   kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>
>   return 0;
>>> (Honestly I just imitated fb279950ba here; I'm not making any better
>>> argument for this diff. But, independently, I wonder why this hunk
>>> didn't have the noncoherent DMA check either, originally.)
>>
>> Great job.  I look forward to the testing results.
>>
>> It should also have the noncoherent DMA check, in fact, though that's
>> just an optimization and it would have masked the bug on your system.
> 
> Hmm... but kvm_zap_gfn_range even other shadow page zapping operations
> are really usual behaviour in host - it depends on how we handle memory
> overcommit/memory-layout-change on host.
> 
> I doubt it is really a right way to go. kvm_zap_gfn_range() is just a time
> delay factor to trigger OVMF boot issue but there must have other factors
> have such delay too, for example, more vcpus in OVMF, high overload on
> host, etc.

But it's pointless if the quirk is enabled.  Also, bringing up APs will
cause heavy contention on mmu_lock as Laszlo pointed out.

Paolo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Laszlo Ersek
On 11/03/15 15:58, Janusz Mocek wrote:
> W dniu 03.11.2015 o 14:31, Laszlo Ersek pisze:
>> On 11/03/15 13:34, Paolo Bonzini wrote:
>>>
>>> On 03/11/2015 02:14, Laszlo Ersek wrote:
 Anyway, with the following host kernel change, the AP startup problem
 goes away (tested on top of v4.3):

>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 9a9a198..4f978ad 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
>> cr0)
>>  if ((cr0 ^ old_cr0) & update_bits)
>>  kvm_mmu_reset_context(vcpu);
>>
>> -if ((cr0 ^ old_cr0) & X86_CR0_CD)
>> +if (!kvm_check_has_quirk(vcpu->kvm, 
>> KVM_X86_QUIRK_CD_NW_CLEARED) &&
>> +(cr0 ^ old_cr0) & X86_CR0_CD)
>>  kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>>
>>  return 0;
 (Honestly I just imitated fb279950ba here; I'm not making any better
 argument for this diff. But, independently, I wonder why this hunk
 didn't have the noncoherent DMA check either, originally.)
>>> Great job.  I look forward to the testing results.
>>>
>>> It should also have the noncoherent DMA check, in fact, though that's
>>> just an optimization and it would have masked the bug on your system.
>> Thank you! I'll wait for Janusz's results, and if stuff works, I'll post
>> the patch with the update you're suggesting (zap only if the guest has
>> noncoherent DMA and lacks the quirk).
>>
>> Laszlo
>>
>>> Thanks,
>>>
>>> Paolo
>>>
 Janusz, could you rebuild your host kernel with this patch, and share
 the results? (I'm also attaching the same as a formatted patch, so you
 can apply it with "git am" easily.)
> Looks like its working correctly now
> 

Awesome, thank you for testing it. I'll put together a patch.

Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Andrew Fish

> On Nov 3, 2015, at 11:42 AM, Jordan Justen  wrote:
> 
> On 2015-11-03 05:45:52, Paolo Bonzini wrote:
>> 
>> 
>> On 03/11/2015 14:25, Laszlo Ersek wrote:
>>>  - Agreement between Paolo, Jordan and Mike about implementing
>>>broadcast SMIs. I am willing to code up whatever design is
>>>agreed upon. Can everyone involved please prioritize this
>>>discussion a little?
> 
> Obviously we'll have to follow what QEMU decides, so I'm not sure how
> much we can actually influence this.
> 
> Aside from the desire to better emulate chipset/platform designs of
> the actual hardware, I do have a related question.
> 

The SMI IPI is sent from a processor in SMM, so like all other hardware 
resources used in SMM the SMM handler needs to restore any state it changes on 
exit. The old PCI 0xCF8/0xCFC IO ports are the old school example of this need 
to restore state to be transparent to the OS. 

Funny old school story but this kind of issue cropped up with the IPMI in the 
1990's  as the hardware interface was an indexed IO registers, and the 
communications was a message (multiple bytes of data). It was not always 
possible to home the IPMI state machine and some times the SMM handler would 
bork the OS IPMI driver state. The final resolution was adding a 2nd set of 
registers dedicated to SMM so the IPMI device could probably track state in all 
cases.  


> If we use the local apic to initiate IPIs to other processors, what
> impact might that have on the state of the local apic if the OS is
> also trying to use it? For example, what if the OS is in the middle of
> trying to send an IPI?
> 
> I think real platforms get to nicely avoid this by having the chipset
> assert SMI to the hardware pins of all processors in the system. It
> looks like our code has a provision for having one processor send the
> others into SMM, but I wonder under which scenarios this is used, and
> how they deal with the local apic state in that case.
> 

I've seen the SMI IPI in code, but I think it is mostly used to deal with the 
case that the processor is sitting in an INIT and the broadcast SMI was 
ignored. 

Thanks,

Andrew Fish

>> Actually, I was *de*prioritizing it because things _more or less_ work
>> without it, especially if the timeout is temporarily reduced.  We
>> probably agree that timeouts are evil in a virt setting, but even
>> without this issue you can commit the ~70 patches that bring us to 99%
>> of the way.
>> 
>> Once the slate is clean and everybody is focused on the few remaining
>> problems, we can tackle them---including broadcast SMIs.
>> 
>> In fact, the rest of your email lists some more pressing problems than
>> broadcast SMIs, which I'm quoting below to further remind other readers. :)
> 
> Since it involves multiple projects, could it take longer to
> coordinate a change?
> 
> -Jordan
> 
>> 
>> Paolo
>> 
>>>  - Solving the MP-related ExitBootServices() handler bug in CpuDxe.
>>>Patches have been on the list for almost a week. While I've been
>>>breaking my back testing and reviewing patches for others (not
>>>just on edk2-devel, mind you), nobody has batted an eye about
>>>that series.
>>> 
>>>[edk2] [PATCH v2 0/4] UefiCpuPkg/CpuDxe: Fix ExitBootServices()
>>>  callback in the presence of SMIs
>>>http://thread.gmane.org/gmane.comp.bios.edk2.devel/3518
>>> 
>>>Please review it, so that I can include it at the front of my
>>>upcoming v4 SMM series.
>>> 
>>>  - Thirty (30) patches from the SMM series still need reviews. Once
>>>all of the above is covered, I'll update the OvmfPkg/README
>>>patch about the status, and post version 4. I wouldn't mind if
>>>we could commit the series still in 2015, but I'm getting less
>>>and less hopeful.
> ___
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel

___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Paolo Bonzini


On 03/11/2015 20:42, Jordan Justen wrote:
> On 2015-11-03 05:45:52, Paolo Bonzini wrote:
>>
>>
>> On 03/11/2015 14:25, Laszlo Ersek wrote:
>>>   - Agreement between Paolo, Jordan and Mike about implementing
>>> broadcast SMIs. I am willing to code up whatever design is
>>> agreed upon. Can everyone involved please prioritize this
>>> discussion a little?
> 
> Obviously we'll have to follow what QEMU decides, so I'm not sure how
> much we can actually influence this.
> 
> Aside from the desire to better emulate chipset/platform designs of
> the actual hardware, I do have a related question.
> 
> If we use the local apic to initiate IPIs to other processors, what
> impact might that have on the state of the local apic if the OS is
> also trying to use it? For example, what if the OS is in the middle of
> trying to send an IPI?

A sequence like

   write to ICR
   out to 0xb2
   write to ICR2

is invalid.

On the other hand, the state of the _destination_ APICs is unaffected by
any IPIs that are delivered between writes to ICR and ICR2.

So it's okay to use the APIC from SMM, as long as it's only done on the
processor that received the synchronous SMI.

> Since it involves multiple projects, could it take longer to
> coordinate a change?

Perhaps, but on the other hand it's just the final 1%.  It's worthless
if we do not have the rest.

Paolo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Laszlo Ersek
On 11/03/15 14:25, Laszlo Ersek wrote:

> (6) In closing:
> 
> - The LZMA decompression slowdown has been quirked in the upstream
>   kernel. A KVM ioctl() has been introduced that allows QEMU
>   userspace to *disable* any quirks, but as far as I can see in
>   QEMU, this is not being done automatically, nor is it exposed to
>   the user. Therefore we can conclude that this issue has been
>   solved for the mid term at least.

I consider this "done" for now.

> 
> - I have proved that the AP startup slowdown is a separate issue,
>   one that is caused (or "exposed") by the "other" part of host
>   kernel commit b18d5431acc7 -- the part that has lived on to v4.3
>   un-quirked.
> 
>   I'll await Janusz's test results, and then I'll submit my proposed
>   fix (= extending the quirk) to the KVM mailing list. With that we
>   should consider this issue also resolved for OVMF, for the mid
>   term.

Janusz reported back (thanks for that) with positive results (thanks for
*that*), and I have since posted the KVM patch. (As I indicated there,
that patch is not related to the SMM work.)

> - Rather than auditing the MP services implementation in
>   UefiCpuPkg/CpuDxe *right now*, can we *PLEASE* focus on the SMM
>   series?
> 
>   I've been working on it (with some intermittent delays due to
>   external pressure) since *April*. We are so close now. This is
>   what remains to be done:
> 
>   - Agreement between Paolo, Jordan and Mike about implementing
> broadcast SMIs. I am willing to code up whatever design is
> agreed upon. Can everyone involved please prioritize this
> discussion a little?

We can postpone this question for now. I've withdrawn my proof of
concept QEMU patch on qemu-devel, and I shelved the OVMF-side patch too.
This has been made possible by Paolo's idea to return to the relaxed
mode BSP/AP sync mode, *and* to mitigate the perf hit when an AP enters
SMM first / only, and calls in the BSP, by shortening the relevant timeout.

I just wrote a patch for the second part, and retested this approach; it
works well.

>   - Solving the MP-related ExitBootServices() handler bug in CpuDxe.
> Patches have been on the list for almost a week. While I've been
> breaking my back testing and reviewing patches for others (not
> just on edk2-devel, mind you), nobody has batted an eye about
> that series.
> 
> [edk2] [PATCH v2 0/4] UefiCpuPkg/CpuDxe: Fix ExitBootServices()
>   callback in the presence of SMIs
> http://thread.gmane.org/gmane.comp.bios.edk2.devel/3518
> 
> Please review it, so that I can include it at the front of my
> upcoming v4 SMM series.

As I indicated in that thread a few hours (?) ago, I'm withdrawing this
series as well, for now. The series aims to fix a bug in the
ExitBootServices() handler that is only triggered by broadcast SMIs, and
by returning to the relaxed mode / unicast SMI approach (see above),
this bug ceases to block the SMM work. Postponed.

> 
>   - Thirty (30) patches from the SMM series still need reviews. Once
> all of the above is covered, I'll update the OvmfPkg/README
> patch about the status, and post version 4. I wouldn't mind if
> we could commit the series still in 2015, but I'm getting less
> and less hopeful.

I'm in the process of formatting and double checking version 4. I'll
post it later tonight.

The main things to keep in mind about v4 are:

- KVM + X64 + MP + S3 works (the most important use case);
  where S3 resume requires the Ia32X64 build

- The series has no pending dependencies. All prereq patches are
  present in upstream edk2, QEMU, and (queued) for KVM. The
  cross-component diffstat is edk2/OvmfPkg/ only.

Dependent on review feedback, I'll propose v4 for merging.

Thanks
Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Jordan Justen
On 2015-11-03 05:45:52, Paolo Bonzini wrote:
> 
> 
> On 03/11/2015 14:25, Laszlo Ersek wrote:
> >   - Agreement between Paolo, Jordan and Mike about implementing
> > broadcast SMIs. I am willing to code up whatever design is
> > agreed upon. Can everyone involved please prioritize this
> > discussion a little?

Obviously we'll have to follow what QEMU decides, so I'm not sure how
much we can actually influence this.

Aside from the desire to better emulate chipset/platform designs of
the actual hardware, I do have a related question.

If we use the local apic to initiate IPIs to other processors, what
impact might that have on the state of the local apic if the OS is
also trying to use it? For example, what if the OS is in the middle of
trying to send an IPI?

I think real platforms get to nicely avoid this by having the chipset
assert SMI to the hardware pins of all processors in the system. It
looks like our code has a provision for having one processor send the
others into SMM, but I wonder under which scenarios this is used, and
how they deal with the local apic state in that case.

> Actually, I was *de*prioritizing it because things _more or less_ work
> without it, especially if the timeout is temporarily reduced.  We
> probably agree that timeouts are evil in a virt setting, but even
> without this issue you can commit the ~70 patches that bring us to 99%
> of the way.
> 
> Once the slate is clean and everybody is focused on the few remaining
> problems, we can tackle them---including broadcast SMIs.
> 
> In fact, the rest of your email lists some more pressing problems than
> broadcast SMIs, which I'm quoting below to further remind other readers. :)

Since it involves multiple projects, could it take longer to
coordinate a change?

-Jordan

> 
> Paolo
> 
> >   - Solving the MP-related ExitBootServices() handler bug in CpuDxe.
> > Patches have been on the list for almost a week. While I've been
> > breaking my back testing and reviewing patches for others (not
> > just on edk2-devel, mind you), nobody has batted an eye about
> > that series.
> > 
> > [edk2] [PATCH v2 0/4] UefiCpuPkg/CpuDxe: Fix ExitBootServices()
> >   callback in the presence of SMIs
> > http://thread.gmane.org/gmane.comp.bios.edk2.devel/3518
> > 
> > Please review it, so that I can include it at the front of my
> > upcoming v4 SMM series.
> > 
> >   - Thirty (30) patches from the SMM series still need reviews. Once
> > all of the above is covered, I'll update the OvmfPkg/README
> > patch about the status, and post version 4. I wouldn't mind if
> > we could commit the series still in 2015, but I'm getting less
> > and less hopeful.
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Paolo Bonzini


On 03/11/2015 14:25, Laszlo Ersek wrote:
>   - Agreement between Paolo, Jordan and Mike about implementing
> broadcast SMIs. I am willing to code up whatever design is
> agreed upon. Can everyone involved please prioritize this
> discussion a little?

Actually, I was *de*prioritizing it because things _more or less_ work
without it, especially if the timeout is temporarily reduced.  We
probably agree that timeouts are evil in a virt setting, but even
without this issue you can commit the ~70 patches that bring us to 99%
of the way.

Once the slate is clean and everybody is focused on the few remaining
problems, we can tackle them---including broadcast SMIs.

In fact, the rest of your email lists some more pressing problems than
broadcast SMIs, which I'm quoting below to further remind other readers. :)

Paolo

>   - Solving the MP-related ExitBootServices() handler bug in CpuDxe.
> Patches have been on the list for almost a week. While I've been
> breaking my back testing and reviewing patches for others (not
> just on edk2-devel, mind you), nobody has batted an eye about
> that series.
> 
> [edk2] [PATCH v2 0/4] UefiCpuPkg/CpuDxe: Fix ExitBootServices()
>   callback in the presence of SMIs
> http://thread.gmane.org/gmane.comp.bios.edk2.devel/3518
> 
> Please review it, so that I can include it at the front of my
> upcoming v4 SMM series.
> 
>   - Thirty (30) patches from the SMM series still need reviews. Once
> all of the above is covered, I'll update the OvmfPkg/README
> patch about the status, and post version 4. I wouldn't mind if
> we could commit the series still in 2015, but I'm getting less
> and less hopeful.
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Laszlo Ersek
On 11/03/15 14:45, Paolo Bonzini wrote:
> 
> 
> On 03/11/2015 14:25, Laszlo Ersek wrote:
>>   - Agreement between Paolo, Jordan and Mike about implementing
>> broadcast SMIs. I am willing to code up whatever design is
>> agreed upon. Can everyone involved please prioritize this
>> discussion a little?
> 
> Actually, I was *de*prioritizing it because things _more or less_ work
> without it, especially if the timeout is temporarily reduced.  We
> probably agree that timeouts are evil in a virt setting, but even
> without this issue you can commit the ~70 patches that bring us to 99%
> of the way.
> 
> Once the slate is clean and everybody is focused on the few remaining
> problems, we can tackle them---including broadcast SMIs.

Works for me. Keeping your patch "OvmfPkg: use relaxed AP SMM
synchronization mode" in the build (and not broadcasting SMIs in either
component) certainly results in a functional binary. The perf penalty
could be fine-tuned by decimating the timeout. (I don't recall if I
*planned* to divide the current timeout by ten, or actually *tested* it.
In any case, I recall thinking that 1/10th of the otherwise seen lag
would be pretty tolerable.)

So, this works for me definitely.

Thank you
Laszlo

> 
> In fact, the rest of your email lists some more pressing problems than
> broadcast SMIs, which I'm quoting below to further remind other readers. :)
> 
> Paolo
> 
>>   - Solving the MP-related ExitBootServices() handler bug in CpuDxe.
>> Patches have been on the list for almost a week. While I've been
>> breaking my back testing and reviewing patches for others (not
>> just on edk2-devel, mind you), nobody has batted an eye about
>> that series.
>>
>> [edk2] [PATCH v2 0/4] UefiCpuPkg/CpuDxe: Fix ExitBootServices()
>>   callback in the presence of SMIs
>> http://thread.gmane.org/gmane.comp.bios.edk2.devel/3518
>>
>> Please review it, so that I can include it at the front of my
>> upcoming v4 SMM series.
>>
>>   - Thirty (30) patches from the SMM series still need reviews. Once
>> all of the above is covered, I'll update the OvmfPkg/README
>> patch about the status, and post version 4. I wouldn't mind if
>> we could commit the series still in 2015, but I'm getting less
>> and less hopeful.

___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-03 Thread Janusz Mocek
W dniu 03.11.2015 o 14:31, Laszlo Ersek pisze:
> On 11/03/15 13:34, Paolo Bonzini wrote:
>>
>> On 03/11/2015 02:14, Laszlo Ersek wrote:
>>> Anyway, with the following host kernel change, the AP startup problem
>>> goes away (tested on top of v4.3):
>>>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a9a198..4f978ad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -622,7 +622,8 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
> cr0)
>   if ((cr0 ^ old_cr0) & update_bits)
>   kvm_mmu_reset_context(vcpu);
>
> - if ((cr0 ^ old_cr0) & X86_CR0_CD)
> + if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED) &&
> + (cr0 ^ old_cr0) & X86_CR0_CD)
>   kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>
>   return 0;
>>> (Honestly I just imitated fb279950ba here; I'm not making any better
>>> argument for this diff. But, independently, I wonder why this hunk
>>> didn't have the noncoherent DMA check either, originally.)
>> Great job.  I look forward to the testing results.
>>
>> It should also have the noncoherent DMA check, in fact, though that's
>> just an optimization and it would have masked the bug on your system.
> Thank you! I'll wait for Janusz's results, and if stuff works, I'll post
> the patch with the update you're suggesting (zap only if the guest has
> noncoherent DMA and lacks the quirk).
>
> Laszlo
>
>> Thanks,
>>
>> Paolo
>>
>>> Janusz, could you rebuild your host kernel with this patch, and share
>>> the results? (I'm also attaching the same as a formatted patch, so you
>>> can apply it with "git am" easily.)
Looks like its working correctly now

___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-02 Thread Laszlo Ersek
On 10/31/15 18:50, Laszlo Ersek wrote:

> I'm very sorry, but I don't think I can spend time on this, unless
> someone gives me ssh and/or console access to a host that readily
> reproduces the bug, with the latest kvm/master, qemu, and ekd2
> builds.

We just got lucky, the problem reproduces on my oldie workstation (HP Z400, (R) 
Xeon(R) CPU W3550 @ 3.07GHz, family/model/stepping/microcode = 6/26/5/0x19).

OVMF works fine with the most recent Fedora 21 host kernel, 
"kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new upstream Linux 
4.3 release.

The OVMF log contains a number of failed ASSERT()s, printed by the APs, in a 
way that is very characteristic of multi-threading bugs:

--*--

Detect CPU count: 2
AAEERRTT  
//hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
 8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c kLVoaclkuVea 
l=u=e  (=(=U I(N(TUNI)N T1N))^M 
1)
ASSERT 
/home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
 LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)

--*--

(Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count: 2".)


Maybe important, maybe not: my workstation's IOMMU does *not* have snoop 
control:

[3.043297] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap c90780106f0462 
ecap f02076

Where 0xf02076 & 0x80 == 0.

Whereas my laptop *does* have Snoop Control:

[0.030486] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap c020660462 
ecap f0101a
[0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008020660462 
ecap f010da

The first IOMMU doesn't have Snoop Control, but its scope only extends to the 
integrated Intel Graphics Device. The second IOMMU, which covers the SD card 
reader I used for testing (and all other PCI devices), *does* have Snoop 
Control. 0xf010da & 0x80 = 0x80.

However, SC support should not matter, due to kernel commit 
fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be looking an 
independent issue. (Possibly in CpuDxe, only exposed by the new host kernel.)

I'll try to track this down.

Thanks
Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-02 Thread Jordan Justen
On 2015-11-02 17:14:12, Laszlo Ersek wrote:
> On 11/02/15 23:49, Laszlo Ersek wrote:
> > On 11/02/15 23:29, Jordan Justen wrote:
> >> On 2015-11-02 12:16:12, Laszlo Ersek wrote:
> >>> On 11/02/15 19:53, Alex Williamson wrote:
>  On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
> > On 10/31/15 18:50, Laszlo Ersek wrote:
> >
> >> I'm very sorry, but I don't think I can spend time on this,
> >> unless someone gives me ssh and/or console access to a host that
> >> readily reproduces the bug, with the latest kvm/master, qemu, and
> >> ekd2 builds.
> >
> > We just got lucky, the problem reproduces on my oldie workstation
> > (HP Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz,
> > family/model/stepping/microcode = 6/26/5/0x19).
> >
> > OVMF works fine with the most recent Fedora 21 host kernel,
> > "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new
> > upstream Linux 4.3 release.
> >
> > The OVMF log contains a number of failed ASSERT()s, printed by the
> > APs, in a way that is very characteristic of multi-threading bugs:
> >
> > --*--
> >
> > Detect CPU count: 2
> > AAEERRTT  
> > //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
> >  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c 
> > kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T1N))^M
> > 1)
> > ASSERT 
> > /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
> >  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)
> >
> > --*--
> >
> > (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count:
> > 2".)
> >
> >
> > Maybe important, maybe not: my workstation's IOMMU does *not* have
> > snoop control:
> >
> > [3.043297] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
> > c90780106f0462 ecap f02076
> >
> > Where 0xf02076 & 0x80 == 0.
> >
> > Whereas my laptop *does* have Snoop Control:
> >
> > [0.030486] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
> > c020660462 ecap f0101a
> > [0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
> > d2008020660462 ecap f010da
> >
> > The first IOMMU doesn't have Snoop Control, but its scope only
> > extends to the integrated Intel Graphics Device. The second IOMMU,
> > which covers the SD card reader I used for testing (and all other
> > PCI devices), *does* have Snoop Control. 0xf010da & 0x80 = 0x80.
> >
> > However, SC support should not matter, due to kernel commit
> > fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be
> > looking an independent issue. (Possibly in CpuDxe, only exposed by
> > the new host
> > kernel.)
> 
>  Snoop Control is what I immediately jumped to as well, but I've
>  never seen the problem on my desktop, which reports the following
>  ecaps: f0101a & f0105a.  Thanks,
> >>>
> >>> This issue is actually very easy to reproduce on my workstation. It
> >>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32
> >>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits
> >>> immediately. Device assignment is not necessary.
> >>>
> >>> But, it also reproduces fairly regularly with just 8 VCPUs (even
> >>> with VCPUs pinned to separate logical processors on the host).
> >>>
> >>> Judging purely from the KVM trace captured during such a 32-VCPU
> >>> startup, I'm suspecting a general thread scheduling (or signaling)
> >>> change sometime in the v4.1..v4.2 interval. It takes *extremely*
> >>> long after the second SIPI (which should wake the AP threads) until
> >>> the first "kvm_entry" message shows up for the last nonzero VCPU.
> >>>
> >>> Below is an excerpt. The first line shows the 2nd SIPI being
> >>> delivered to the last VCPU. The lines after that show the first time
> >>> that each nonzero VCPU reports in with "kvm_entry".
> >>>
> >>>  qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 
> >>> 1f vec 159 (SIPI|edge)
> >>> ...
> >>>  qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
> >>>  qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
> >>>  qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
> >>>  qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
> >>>  qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
> >>>  qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
> >>>  qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
> >>>  qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
> >>>  qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
> >>>  qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-02 Thread Laszlo Ersek
On 11/02/15 23:29, Jordan Justen wrote:
> On 2015-11-02 12:16:12, Laszlo Ersek wrote:
>> On 11/02/15 19:53, Alex Williamson wrote:
>>> On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
 On 10/31/15 18:50, Laszlo Ersek wrote:

> I'm very sorry, but I don't think I can spend time on this, unless
> someone gives me ssh and/or console access to a host that readily
> reproduces the bug, with the latest kvm/master, qemu, and ekd2
> builds.

 We just got lucky, the problem reproduces on my oldie workstation (HP 
 Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz, family/model/stepping/microcode = 
 6/26/5/0x19).

 OVMF works fine with the most recent Fedora 21 host kernel, 
 "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new upstream 
 Linux 4.3 release.

 The OVMF log contains a number of failed ASSERT()s, printed by the APs, in 
 a way that is very characteristic of multi-threading bugs:

 --*--

 Detect CPU count: 2
 AAEERRTT  
 //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c 
 kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T1N))^M 
 1)
 ASSERT 
 /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)

 --*--

 (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count: 2".)


 Maybe important, maybe not: my workstation's IOMMU does *not* have snoop 
 control:

 [3.043297] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
 c90780106f0462 ecap f02076

 Where 0xf02076 & 0x80 == 0.

 Whereas my laptop *does* have Snoop Control:

 [0.030486] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
 c020660462 ecap f0101a
 [0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
 d2008020660462 ecap f010da

 The first IOMMU doesn't have Snoop Control, but its scope only extends to 
 the integrated Intel Graphics Device. The second IOMMU, which covers the 
 SD card reader I used for testing (and all other PCI devices), *does* have 
 Snoop Control. 0xf010da & 0x80 = 0x80.

 However, SC support should not matter, due to kernel commit 
 fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be looking 
 an independent issue. (Possibly in CpuDxe, only exposed by the new host 
 kernel.)
>>>
>>> Snoop Control is what I immediately jumped to as well, but I've never
>>> seen the problem on my desktop, which reports the following ecaps:
>>> f0101a & f0105a.  Thanks,
>>
>> This issue is actually very easy to reproduce on my workstation. It
>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32
>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits
>> immediately. Device assignment is not necessary.
>>
>> But, it also reproduces fairly regularly with just 8 VCPUs (even
>> with VCPUs pinned to separate logical processors on the host).
>>
>> Judging purely from the KVM trace captured during such a 32-VCPU
>> startup, I'm suspecting a general thread scheduling (or signaling)
>> change sometime in the v4.1..v4.2 interval. It takes *extremely*
>> long after the second SIPI (which should wake the AP threads) until
>> the first "kvm_entry" message shows up for the last nonzero VCPU.
>>
>> Below is an excerpt. The first line shows the 2nd SIPI being
>> delivered to the last VCPU. The lines after that show the first time
>> that each nonzero VCPU reports in with "kvm_entry".
>>
>>  qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 1f 
>> vec 159 (SIPI|edge)
>> ...
>>  qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
>>  qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
>>  qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
>>  qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
>>  qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
>>  qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
>>  qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
>>  qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
>>  qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
>>  qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  17
>>  qemu-system-x86-19122  [007]  6668.948561:  kvm_entry:  vcpu  8
>>  qemu-system-x86-19120  [001]  6668.964832:  kvm_entry:  vcpu  6
>>  qemu-system-x86-19142  [002]  6668.964962:  kvm_entry:  vcpu  28
>>  qemu-system-x86-19125  [000]  6668.964964:  kvm_entry:  vcpu  11
>>  qemu-system-x86-19139  [007]  6668.964965:  kvm_entry:  vcpu  25
>>  qemu-system-x86-19123  [002]  

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-11-02 Thread Laszlo Ersek
On 11/02/15 23:49, Laszlo Ersek wrote:
> On 11/02/15 23:29, Jordan Justen wrote:
>> On 2015-11-02 12:16:12, Laszlo Ersek wrote:
>>> On 11/02/15 19:53, Alex Williamson wrote:
 On Mon, 2015-11-02 at 19:31 +0100, Laszlo Ersek wrote:
> On 10/31/15 18:50, Laszlo Ersek wrote:
>
>> I'm very sorry, but I don't think I can spend time on this,
>> unless someone gives me ssh and/or console access to a host that
>> readily reproduces the bug, with the latest kvm/master, qemu, and
>> ekd2 builds.
>
> We just got lucky, the problem reproduces on my oldie workstation
> (HP Z400, (R) Xeon(R) CPU W3550 @ 3.07GHz,
> family/model/stepping/microcode = 6/26/5/0x19).
>
> OVMF works fine with the most recent Fedora 21 host kernel,
> "kernel-4.1.10-100.fc21.x86_64", and it breaks on the crispy new
> upstream Linux 4.3 release.
>
> The OVMF log contains a number of failed ASSERT()s, printed by the
> APs, in a way that is very characteristic of multi-threading bugs:
>
> --*--
>
> Detect CPU count: 2
> AAEERRTT  
> //hhoomem/el/alcaocso/ss/rscr/cu/pusptsrteraema/me/dekd2k-2g-igti-ts-vsnv/nM/dMedPekPgk/gL/iLbirbarrayr/yB/aBsaesSeySnycnhcrhornoiznatiizoantLiiobn/LSiybn/cShyrnocnhirzoantiizoantGicocn.Gcc(c1.9c8()1:9
>  8L)o:c kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T2N))  |2|)  L|o|c 
> kLVoaclkuVea l=u=e  (=(=U I(N(TUNI)N T1N))^M
> 1)
> ASSERT 
> /home/lacos/src/upstream/edk2-git-svn/MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(198):
>  LockValue == ((UINTN) 2) || LockValue == ((UINTN) 1)
>
> --*--
>
> (Note that the above is for a 8 VCPU guest. Cf. "Detect CPU count:
> 2".)
>
>
> Maybe important, maybe not: my workstation's IOMMU does *not* have
> snoop control:
>
> [3.043297] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
> c90780106f0462 ecap f02076
>
> Where 0xf02076 & 0x80 == 0.
>
> Whereas my laptop *does* have Snoop Control:
>
> [0.030486] DMAR: dmar0: reg_base_addr fed9 ver 1:0 cap 
> c020660462 ecap f0101a
> [0.030491] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap 
> d2008020660462 ecap f010da
>
> The first IOMMU doesn't have Snoop Control, but its scope only
> extends to the integrated Intel Graphics Device. The second IOMMU,
> which covers the SD card reader I used for testing (and all other
> PCI devices), *does* have Snoop Control. 0xf010da & 0x80 = 0x80.
>
> However, SC support should not matter, due to kernel commit
> fb279950ba02e3210a16b11ecfa8871f3ee0ca49, so I think we might be
> looking an independent issue. (Possibly in CpuDxe, only exposed by
> the new host
> kernel.)

 Snoop Control is what I immediately jumped to as well, but I've
 never seen the problem on my desktop, which reports the following
 ecaps: f0101a & f0105a.  Thanks,
>>>
>>> This issue is actually very easy to reproduce on my workstation. It
>>> has 1 socket, 4 cores, 2 threads/core. If I start a guest with 32
>>> VCPUs (i.e., heavily over-subscribing the host), then the issue hits
>>> immediately. Device assignment is not necessary.
>>>
>>> But, it also reproduces fairly regularly with just 8 VCPUs (even
>>> with VCPUs pinned to separate logical processors on the host).
>>>
>>> Judging purely from the KVM trace captured during such a 32-VCPU
>>> startup, I'm suspecting a general thread scheduling (or signaling)
>>> change sometime in the v4.1..v4.2 interval. It takes *extremely*
>>> long after the second SIPI (which should wake the AP threads) until
>>> the first "kvm_entry" message shows up for the last nonzero VCPU.
>>>
>>> Below is an excerpt. The first line shows the 2nd SIPI being
>>> delivered to the last VCPU. The lines after that show the first time
>>> that each nonzero VCPU reports in with "kvm_entry".
>>>
>>>  qemu-system-x86-19114 [005]  6668.931183: kvm_apic_accept_irq:  apicid 1f 
>>> vec 159 (SIPI|edge)
>>> ...
>>>  qemu-system-x86-19136  [006]  6668.931953:  kvm_entry:  vcpu  22
>>>  qemu-system-x86-19133  [003]  6668.931953:  kvm_entry:  vcpu  19
>>>  qemu-system-x86-19138  [007]  6668.931953:  kvm_entry:  vcpu  24
>>>  qemu-system-x86-19118  [004]  6668.934967:  kvm_entry:  vcpu  4
>>>  qemu-system-x86-19119  [006]  6668.941502:  kvm_entry:  vcpu  5
>>>  qemu-system-x86-19121  [004]  6668.944957:  kvm_entry:  vcpu  7
>>>  qemu-system-x86-19124  [001]  6668.944957:  kvm_entry:  vcpu  10
>>>  qemu-system-x86-19137  [000]  6668.948561:  kvm_entry:  vcpu  23
>>>  qemu-system-x86-19140  [002]  6668.948561:  kvm_entry:  vcpu  26
>>>  qemu-system-x86-19131  [003]  6668.948561:  kvm_entry:  vcpu  17
>>>  qemu-system-x86-19122  [007]  6668.948561:  kvm_entry:  vcpu  8
>>>  qemu-system-x86-19120  [001]  6668.964832:  kvm_entry:  vcpu  6
>>>  qemu-system-x86-19142  [002]  6668.964962:  kvm_entry:  vcpu  28
>>>  qemu-system-x86-19125  [000]  

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-31 Thread Laszlo Ersek
On 10/30/15 14:39, Laszlo Ersek wrote:
> On 10/30/15 14:04, Janusz Mocek wrote:
>> W dniu 30.10.2015 o 13:26, Laszlo Ersek pisze:
>>> CC'ing Xiao and Alex again.
>>>
>>> On 10/29/15 19:39, Jordan Justen wrote:
 On 2015-10-29 04:45:37, Laszlo Ersek wrote:
> On 10/29/15 02:32, Jordan Justen wrote:
>> +ASSERT (MaxProcessors > 0);
>> +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
> I think that when this branch is active, then
> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
> this hint is available from QEMU, then we should practically disable
> the timeout option in CpuDxe's AP counting.
 I think this is a good idea, but I don't think 71 minutes is useful.
 Perhaps 30 seconds? This seems more than adequate for hundreds of
 processors to startup. Or perhaps some timeout based on the number of
 processors?

 Janusz and I were discussing
 https://github.com/tianocore/edk2/issues/21 on irc. We increased the
 timeout to 10 seconds, and with only 8 processors it was still timing
 out.

 Obviously we are somehow failing to start the processors correctly, or
 QEMU/KVM is doing something wrong.

 Have you been able to reproduce this issue? It seems like we need to
 set the timeout to 71 minutes, and then debug QEMU/KVM to see what
 state the APs are in...

 Unfortunately I haven't yet been able to reproduce the bug on my
 system. :(
>>> I've been staring at the following things for a few tens of minutes now:
>>>
>>> (1) Kernel commit b18d5431acc7. Note that the commit changes the return
>>> value of the vmx_get_mt_mask() function *exactly* in the following
>>> case:
>>>
>>>   kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
>>>   (kvm_read_cr0(vcpu) & X86_CR0_CD)
>>>
>>> The first sub-condition is satisfied by GPU passthrough / device
>>> assignment, I think; the second part depends on the VCPU having
>>> turned on (or having *left* on) CR0.CD.
>>>
>>> (2) Consult the vmx_vcpu_reset() function in "arch/x86/kvm/vmx.c"
>>> (current upstream). You will find:
>>>
>>> cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
>>> vmx_set_cr0(vcpu, cr0); /* enter rmode */
>>>
>>> Meaning a VCPU will start with CD and NW set, in real mode, after
>>> re-set.
>>>
>>> This setting dates back to the birth of KVM:
>>>
>>>   commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
>>>   Author: Avi Kivity 
>>>   Date: Sun Dec 10 02:21:36 2006 -0800
>>>
>>>   [PATCH] kvm: userspace interface
>>>
>>> Search that commit for "0x6010" (the second hit, although the
>>> comment that contains the first hit is quite telling as well).
>>>
>>> (3) Consult the Intel SDM, Table 11-5. "Cache Operating Modes".
>>>
>>> The (CD, NW) == (1, 1) setting in CR0 is documented as:
>>> - "Memory coherency is not maintained."
>>> - "(P6 family and Pentium processors.) State of the processor after
>>>   a power up or reset. "
>>> - [in footnote 2] "The Pentium 4 and more recent processor families
>>>   do not support this mode; setting the CD and NW bits to 1 selects
>>>   the no-fill cache mode."
>>>
>>> In other words, the settings implemented by vmx_vcpu_reset()
>>> actually invoke the behavior of the "no-fill cache mode" (which is
>>> (CD, NW) == (1, 0)) for all practical purposes.
>>>
>>> (4) Same reference.
>>>
>>> The (CD, NW) == (1, 0) setting in CR0 is documented as:
>>> - "No-fill Cache Mode. Memory coherency is maintained."
>>> - "(Pentium 4 and later processor families.) State of processor
>>>   after a power up or reset. "
>>>
>>> (5) The AsmEnableCache() function in
>>> "MdePkg/Library/BaseLib/Ia32/EnableCache.c". It clears both CD and
>>> NW in CR0.
>>>
>>> (6) This setting ((CD, NW) == (0, 0))is documented in the Intel SDM as:
>>> - "Normal Cache Mode. Highest performance cache operation."
>>>
>>> (7) The AsmEnableCache() function is invoked by MtrrLib
>>> [UefiCpuPkg/Library/MtrrLib/MtrrLib.c] after any and all MTRR
>>> changes. Consider:
>>>
>>> PostMtrrChange() | MtrrSetAllMtrrs()
>>>   PostMtrrChangeEnableCache()
>>> AsmEnableCache()
>>>
>>> Where MtrrSetAllMtrrs() is a public function of the library; plus
>>> PostMtrrChange() is invoked by all of the following public
>>> functions:
>>>
>>> - MtrrSetMemoryAttribute()
>>> - MtrrSetVariableMtrr()
>>> - MtrrSetFixedMtrr()
>>>
>>> (8) Because we call MtrrLib in PlatformPei first, there are two
>>> consequences:
>>>
>>> (a) The boot VCPU has CR0.CD *set* in all parts of OVMF that run
>>> earlier than that.
>>>
>>> This caused a widely reported boot perf regression in SEC (the
>>> LZMA decompression). Ultimately another MTRR change in KVM was
>>> 

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-30 Thread Laszlo Ersek
On 10/30/15 14:04, Janusz Mocek wrote:
> W dniu 30.10.2015 o 13:26, Laszlo Ersek pisze:
>> CC'ing Xiao and Alex again.
>>
>> On 10/29/15 19:39, Jordan Justen wrote:
>>> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
 On 10/29/15 02:32, Jordan Justen wrote:
> +ASSERT (MaxProcessors > 0);
> +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
 I think that when this branch is active, then
 PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
 MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
 this hint is available from QEMU, then we should practically disable
 the timeout option in CpuDxe's AP counting.
>>> I think this is a good idea, but I don't think 71 minutes is useful.
>>> Perhaps 30 seconds? This seems more than adequate for hundreds of
>>> processors to startup. Or perhaps some timeout based on the number of
>>> processors?
>>>
>>> Janusz and I were discussing
>>> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
>>> timeout to 10 seconds, and with only 8 processors it was still timing
>>> out.
>>>
>>> Obviously we are somehow failing to start the processors correctly, or
>>> QEMU/KVM is doing something wrong.
>>>
>>> Have you been able to reproduce this issue? It seems like we need to
>>> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
>>> state the APs are in...
>>>
>>> Unfortunately I haven't yet been able to reproduce the bug on my
>>> system. :(
>> I've been staring at the following things for a few tens of minutes now:
>>
>> (1) Kernel commit b18d5431acc7. Note that the commit changes the return
>> value of the vmx_get_mt_mask() function *exactly* in the following
>> case:
>>
>>   kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
>>   (kvm_read_cr0(vcpu) & X86_CR0_CD)
>>
>> The first sub-condition is satisfied by GPU passthrough / device
>> assignment, I think; the second part depends on the VCPU having
>> turned on (or having *left* on) CR0.CD.
>>
>> (2) Consult the vmx_vcpu_reset() function in "arch/x86/kvm/vmx.c"
>> (current upstream). You will find:
>>
>>  cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
>>  vmx_set_cr0(vcpu, cr0); /* enter rmode */
>>
>> Meaning a VCPU will start with CD and NW set, in real mode, after
>> re-set.
>>
>> This setting dates back to the birth of KVM:
>>
>>   commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
>>   Author: Avi Kivity 
>>   Date: Sun Dec 10 02:21:36 2006 -0800
>>
>>   [PATCH] kvm: userspace interface
>>
>> Search that commit for "0x6010" (the second hit, although the
>> comment that contains the first hit is quite telling as well).
>>
>> (3) Consult the Intel SDM, Table 11-5. "Cache Operating Modes".
>>
>> The (CD, NW) == (1, 1) setting in CR0 is documented as:
>> - "Memory coherency is not maintained."
>> - "(P6 family and Pentium processors.) State of the processor after
>>   a power up or reset. "
>> - [in footnote 2] "The Pentium 4 and more recent processor families
>>   do not support this mode; setting the CD and NW bits to 1 selects
>>   the no-fill cache mode."
>>
>> In other words, the settings implemented by vmx_vcpu_reset()
>> actually invoke the behavior of the "no-fill cache mode" (which is
>> (CD, NW) == (1, 0)) for all practical purposes.
>>
>> (4) Same reference.
>>
>> The (CD, NW) == (1, 0) setting in CR0 is documented as:
>> - "No-fill Cache Mode. Memory coherency is maintained."
>> - "(Pentium 4 and later processor families.) State of processor
>>   after a power up or reset. "
>>
>> (5) The AsmEnableCache() function in
>> "MdePkg/Library/BaseLib/Ia32/EnableCache.c". It clears both CD and
>> NW in CR0.
>>
>> (6) This setting ((CD, NW) == (0, 0))is documented in the Intel SDM as:
>> - "Normal Cache Mode. Highest performance cache operation."
>>
>> (7) The AsmEnableCache() function is invoked by MtrrLib
>> [UefiCpuPkg/Library/MtrrLib/MtrrLib.c] after any and all MTRR
>> changes. Consider:
>>
>> PostMtrrChange() | MtrrSetAllMtrrs()
>>   PostMtrrChangeEnableCache()
>> AsmEnableCache()
>>
>> Where MtrrSetAllMtrrs() is a public function of the library; plus
>> PostMtrrChange() is invoked by all of the following public
>> functions:
>>
>> - MtrrSetMemoryAttribute()
>> - MtrrSetVariableMtrr()
>> - MtrrSetFixedMtrr()
>>
>> (8) Because we call MtrrLib in PlatformPei first, there are two
>> consequences:
>>
>> (a) The boot VCPU has CR0.CD *set* in all parts of OVMF that run
>> earlier than that.
>>
>> This caused a widely reported boot perf regression in SEC (the
>> LZMA decompression). Ultimately another MTRR change in KVM was
>> reverted, so (as far as I know) this symptom has not been seen
>> recently. (In any case, we should probably fix this sometime...)
>>
>> (b) The 

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-30 Thread Laszlo Ersek
On 10/30/15 13:26, Laszlo Ersek wrote:

>> diff --git a/UefiCpuPkg/CpuDxe/CpuMp.c b/UefiCpuPkg/CpuDxe/CpuMp.c
>> index 3f56faa..e7f5b41 100644
>> --- a/UefiCpuPkg/CpuDxe/CpuMp.c
>> +++ b/UefiCpuPkg/CpuDxe/CpuMp.c
>> @@ -1451,6 +1451,8 @@ ApEntryPointInC (
>>VOID*   TopOfApStack;
>>UINTN   ProcessorNumber;
>>
>> +  AsmEnableCache ();
>> +
>>if (!mAPsAlreadyInitFinished) {
>>  FillInProcessorInformation (FALSE, mMpSystemData.NumberOfProcessors);
>>  TopOfApStack  = (UINT8*)mApStackStart + gApStackSize;
> 
> This should clear CR0.CD, and "undo" kernel commit b18d5431acc7 for
> the AP (by falsifying the second subcondition seen in (1)).
> 
> Janusz, can you please test this one-liner (with no other out-of-tree
> patch applied)?

If it doesn't help (much), we can also try to do this in assembly, in
earlier parts of the startup code. Because, AsmApEntryPoint() in

UefiCpuPkg/CpuDxe/Ia32/MpAsm.asm
UefiCpuPkg/CpuDxe/Ia32/MpAsm.nasm
UefiCpuPkg/CpuDxe/X64/MpAsm.asm
UefiCpuPkg/CpuDxe/X64/MpAsm.nasm

already runs spinlock-like code.

Thanks
Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-30 Thread Laszlo Ersek
CC'ing Xiao and Alex again.

On 10/29/15 19:39, Jordan Justen wrote:
> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
>> On 10/29/15 02:32, Jordan Justen wrote:
>>> +ASSERT (MaxProcessors > 0);
>>> +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
>>
>> I think that when this branch is active, then
>> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
>> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
>> this hint is available from QEMU, then we should practically disable
>> the timeout option in CpuDxe's AP counting.
>
> I think this is a good idea, but I don't think 71 minutes is useful.
> Perhaps 30 seconds? This seems more than adequate for hundreds of
> processors to startup. Or perhaps some timeout based on the number of
> processors?
>
> Janusz and I were discussing
> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
> timeout to 10 seconds, and with only 8 processors it was still timing
> out.
>
> Obviously we are somehow failing to start the processors correctly, or
> QEMU/KVM is doing something wrong.
>
> Have you been able to reproduce this issue? It seems like we need to
> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
> state the APs are in...
>
> Unfortunately I haven't yet been able to reproduce the bug on my
> system. :(

I've been staring at the following things for a few tens of minutes now:

(1) Kernel commit b18d5431acc7. Note that the commit changes the return
value of the vmx_get_mt_mask() function *exactly* in the following
case:

  kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
  (kvm_read_cr0(vcpu) & X86_CR0_CD)

The first sub-condition is satisfied by GPU passthrough / device
assignment, I think; the second part depends on the VCPU having
turned on (or having *left* on) CR0.CD.

(2) Consult the vmx_vcpu_reset() function in "arch/x86/kvm/vmx.c"
(current upstream). You will find:

cr0 = X86_CR0_NW | X86_CR0_CD | X86_CR0_ET;
vmx_set_cr0(vcpu, cr0); /* enter rmode */

Meaning a VCPU will start with CD and NW set, in real mode, after
re-set.

This setting dates back to the birth of KVM:

  commit 6aa8b732ca01c3d7a54e93f4d701b8aabbe60fb7
  Author: Avi Kivity 
  Date: Sun Dec 10 02:21:36 2006 -0800

  [PATCH] kvm: userspace interface

Search that commit for "0x6010" (the second hit, although the
comment that contains the first hit is quite telling as well).

(3) Consult the Intel SDM, Table 11-5. "Cache Operating Modes".

The (CD, NW) == (1, 1) setting in CR0 is documented as:
- "Memory coherency is not maintained."
- "(P6 family and Pentium processors.) State of the processor after
  a power up or reset. "
- [in footnote 2] "The Pentium 4 and more recent processor families
  do not support this mode; setting the CD and NW bits to 1 selects
  the no-fill cache mode."

In other words, the settings implemented by vmx_vcpu_reset()
actually invoke the behavior of the "no-fill cache mode" (which is
(CD, NW) == (1, 0)) for all practical purposes.

(4) Same reference.

The (CD, NW) == (1, 0) setting in CR0 is documented as:
- "No-fill Cache Mode. Memory coherency is maintained."
- "(Pentium 4 and later processor families.) State of processor
  after a power up or reset. "

(5) The AsmEnableCache() function in
"MdePkg/Library/BaseLib/Ia32/EnableCache.c". It clears both CD and
NW in CR0.

(6) This setting ((CD, NW) == (0, 0))is documented in the Intel SDM as:
- "Normal Cache Mode. Highest performance cache operation."

(7) The AsmEnableCache() function is invoked by MtrrLib
[UefiCpuPkg/Library/MtrrLib/MtrrLib.c] after any and all MTRR
changes. Consider:

PostMtrrChange() | MtrrSetAllMtrrs()
  PostMtrrChangeEnableCache()
AsmEnableCache()

Where MtrrSetAllMtrrs() is a public function of the library; plus
PostMtrrChange() is invoked by all of the following public
functions:

- MtrrSetMemoryAttribute()
- MtrrSetVariableMtrr()
- MtrrSetFixedMtrr()

(8) Because we call MtrrLib in PlatformPei first, there are two
consequences:

(a) The boot VCPU has CR0.CD *set* in all parts of OVMF that run
earlier than that.

This caused a widely reported boot perf regression in SEC (the
LZMA decompression). Ultimately another MTRR change in KVM was
reverted, so (as far as I know) this symptom has not been seen
recently. (In any case, we should probably fix this sometime...)

(b) The other consequence is that the boot VCPU's CR0.CD is clear in
the rest of OVMF. Which is what makes its speed acceptable, I
guess (as long as no APs are started up).

(9) Our AP startup code massages CR0, but only for mode switches. CR0.CD
and CR0.NW are never touched.

Now, I guess this could be easily added to the assembly encoded as a
C array 

Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-29 Thread Jordan Justen
On 2015-10-29 04:45:37, Laszlo Ersek wrote:
> On 10/29/15 02:32, Jordan Justen wrote:
> > +ASSERT (MaxProcessors > 0);
> > +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
> 
> I think that when this branch is active, then
> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
> this hint is available from QEMU, then we should practically disable the
> timeout option in CpuDxe's AP counting.

I think this is a good idea, but I don't think 71 minutes is useful.
Perhaps 30 seconds? This seems more than adequate for hundreds of
processors to startup. Or perhaps some timeout based on the number of
processors?

Janusz and I were discussing
https://github.com/tianocore/edk2/issues/21 on irc. We increased the
timeout to 10 seconds, and with only 8 processors it was still timing
out.

Obviously we are somehow failing to start the processors correctly, or
QEMU/KVM is doing something wrong.

Have you been able to reproduce this issue? It seems like we need to
set the timeout to 71 minutes, and then debug QEMU/KVM to see what
state the APs are in...

Unfortunately I haven't yet been able to reproduce the bug on my
system. :(

-Jordan
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-29 Thread Laszlo Ersek
On 10/29/15 19:39, Jordan Justen wrote:
> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
>> On 10/29/15 02:32, Jordan Justen wrote:
>>> +ASSERT (MaxProcessors > 0);
>>> +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
>>
>> I think that when this branch is active, then
>> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
>> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
>> this hint is available from QEMU, then we should practically disable the
>> timeout option in CpuDxe's AP counting.
> 
> I think this is a good idea, but I don't think 71 minutes is useful.
> Perhaps 30 seconds? This seems more than adequate for hundreds of
> processors to startup. Or perhaps some timeout based on the number of
> processors?

No, my suggestion with the 71 minutes didn't aim at a "useful" timeout.
Instead, when QEMU provides the number of VCPUs via fw_cfg, I'd like to
take the timeout *completely* out of the picture. Wait until the
advertised number of VCPUs come up, period. If they don't all appear,
then hang forever. Well, at least for 71 minutes, which is the same for
interactive users.

If 30 seconds elapse and we boot with 1 or 2 VCPUs missing, then things
will break hard. I don't actually *expect* this to occur against a 30
second timeout, but 30 seconds still sends the wrong message to the
programmer and the user. It looks like a real, reasonable timeout. While
in this case, the loop should never exit on a timeout, and 0x
communicates that.

> Janusz and I were discussing
> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
> timeout to 10 seconds, and with only 8 processors it was still timing
> out.

Ugh.

> Obviously we are somehow failing to start the processors correctly, or
> QEMU/KVM is doing something wrong.

I think the actual issue we're fighting here is described in
. Due to the
kernel commit named there, and due to a physical device being assigned
to the guest, guest memory becomes uncacheable for each AP, until the AP
clears CR0.CD. I guess... And that should slow it down extremely.

> Have you been able to reproduce this issue?

I think I have, although I didn't try. :) My current host kernel is
based on v4.3-rc3 (upon which kvm/master is based, upon which I have a
fix), and the commit in question (b18d5431acc7) is part of v4.2-rc1.

If you have a host kernel at least as fresh as v4.2-rc1 (and I do, see
above), then you run into the issue automatically. For which reason I've
been carrying my patch referenced above in my development branches --
I've been focusing on the SMM issues, and solving (or working around)
the MP startup problem is a prerequisite for that.

So, yes, saw it, worked around it immediately, forgot about it. :)

> It seems like we need to
> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
> state the APs are in...

I'm a bit overloaded to tackle this right now, but...

> Unfortunately I haven't yet been able to reproduce the bug on my
> system. :(

if you install a host kernel at least as recent as v4.2-rc1, then the
bug should pop up at once.

Thanks
Laszlo

> 
> -Jordan
> 

___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel


Re: [edk2] [PATCH 6/6] OvmfPkg/PlatformPei: Set PcdCpuMaxLogicalProcessorNumber using QEMU fw_cfg

2015-10-29 Thread Laszlo Ersek
On 10/29/15 22:13, Laszlo Ersek wrote:
> On 10/29/15 19:39, Jordan Justen wrote:
>> On 2015-10-29 04:45:37, Laszlo Ersek wrote:
>>> On 10/29/15 02:32, Jordan Justen wrote:
 +ASSERT (MaxProcessors > 0);
 +PcdSet32 (PcdCpuMaxLogicalProcessorNumber, MaxProcessors);
>>>
>>> I think that when this branch is active, then
>>> PcdCpuApInitTimeOutInMicroSeconds should *also* be set, namely to
>>> MAX_UINT32 (~71 minutes, the closest we can get to "infinity"). When
>>> this hint is available from QEMU, then we should practically disable the
>>> timeout option in CpuDxe's AP counting.
>>
>> I think this is a good idea, but I don't think 71 minutes is useful.
>> Perhaps 30 seconds? This seems more than adequate for hundreds of
>> processors to startup. Or perhaps some timeout based on the number of
>> processors?
> 
> No, my suggestion with the 71 minutes didn't aim at a "useful" timeout.
> Instead, when QEMU provides the number of VCPUs via fw_cfg, I'd like to
> take the timeout *completely* out of the picture. Wait until the
> advertised number of VCPUs come up, period. If they don't all appear,
> then hang forever. Well, at least for 71 minutes, which is the same for
> interactive users.
> 
> If 30 seconds elapse and we boot with 1 or 2 VCPUs missing, then things
> will break hard. I don't actually *expect* this to occur against a 30
> second timeout, but 30 seconds still sends the wrong message to the
> programmer and the user. It looks like a real, reasonable timeout. While
> in this case, the loop should never exit on a timeout, and 0x
> communicates that.
> 
>> Janusz and I were discussing
>> https://github.com/tianocore/edk2/issues/21 on irc. We increased the
>> timeout to 10 seconds, and with only 8 processors it was still timing
>> out.
> 
> Ugh.
> 
>> Obviously we are somehow failing to start the processors correctly, or
>> QEMU/KVM is doing something wrong.
> 
> I think the actual issue we're fighting here is described in
> . Due to the
> kernel commit named there, and due to a physical device being assigned
> to the guest, guest memory becomes uncacheable for each AP, until the AP
> clears CR0.CD. I guess... And that should slow it down extremely.
> 
>> Have you been able to reproduce this issue?
> 
> I think I have, although I didn't try. :) My current host kernel is
> based on v4.3-rc3 (upon which kvm/master is based, upon which I have a
> fix), and the commit in question (b18d5431acc7) is part of v4.2-rc1.
> 
> If you have a host kernel at least as fresh as v4.2-rc1 (and I do, see
> above), then you run into the issue automatically. For which reason I've
> been carrying my patch referenced above in my development branches --
> I've been focusing on the SMM issues, and solving (or working around)
> the MP startup problem is a prerequisite for that.
> 
> So, yes, saw it, worked around it immediately, forgot about it. :)
> 
>> It seems like we need to
>> set the timeout to 71 minutes, and then debug QEMU/KVM to see what
>> state the APs are in...
> 
> I'm a bit overloaded to tackle this right now, but...
> 
>> Unfortunately I haven't yet been able to reproduce the bug on my
>> system. :(
> 
> if you install a host kernel at least as recent as v4.2-rc1, then the
> bug should pop up at once.

Sorry, forgot about a tiny circumstance: you'll also have to assign a
physical device (like a GPU) to your guest. Now *that* is a whole story
per se. See .

In fact, I'm now thinking that I might not have reproduced the issue at
all! (And perhaps I just applied the workaround patch as precaution.) My
workstation does have an assigned GPU, but the kernel on it is older
than v4.2-rc1 -- no issues there. And on my laptop (which has the recent
kernel) no devices can be reasonably assigned to guests. (It has VT-d,
and even the IOMMU groups look good, but the PCI devices are not
"discrete" enough from a guest driver perspective.)

Apologies, I can't help with this right now...

Thanks
Laszlo
___
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel