Re: [Qemu-devel] [PATCH] qga: ignore non present cpus when handling qmp_guest_get_vcpus()

2018-08-31 Thread Laszlo Ersek
On 08/31/18 10:00, Igor Mammedov wrote:
> On Thu, 30 Aug 2018 17:51:13 +0200
> Laszlo Ersek  wrote:
> 
>> +Drew
>>
>> On 08/30/18 14:08, Igor Mammedov wrote:
>>> If VM has VCPUs plugged sparselly (for example a VM started with
>>> 3 VCPUs (cpu0, cpu1 and cpu2) and then cpu1 was hotunplugged so
>>> only cpu0 and cpu2 are present), QGA will rise a error
>>>   error: internal error: unable to execute QEMU agent command 
>>> 'guest-get-vcpus':
>>>   open("/sys/devices/system/cpu/cpu1/"): No such file or directory
>>> when
>>>   virsh vcpucount FOO --guest
>>> is executed.
>>> Fix it by ignoring non present CPUs when fetching CPUs status from sysfs.
>>>
>>> Signed-off-by: Igor Mammedov 
>>> ---
>>>  qga/commands-posix.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>> index 37e8a2d..2929872 100644
>>> --- a/qga/commands-posix.c
>>> +++ b/qga/commands-posix.c
>>> @@ -2044,7 +2044,9 @@ static void transfer_vcpu(GuestLogicalProcessor 
>>> *vcpu, bool sys2vcpu,
>>>vcpu->logical_id);
>>>  dirfd = open(dirpath, O_RDONLY | O_DIRECTORY);
>>>  if (dirfd == -1) {
>>> -error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>> +if (!(sys2vcpu && errno == ENOENT)) {
>>> +error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>> +}
>>>  } else {
>>>  static const char fn[] = "online";
>>>  int fd;
>>>   
>>
>> Originally these guest agent commands (both getting and setting) were
>> meant to be used in the absence of real VCPU hot[un]plug, as a fallback
>> / place-holder.
>>
>> If the latter (= real VCPU hot(un)plug) works, then these guest agent
>> commands shouldn't be used at all.
> Technically there isn't reasons for "get" not to work in sparse  usecase
> hence the patch.
> 
>> Drew, do I remember correctly? ... The related RHBZ is
>> <https://bugzilla.redhat.com/show_bug.cgi?id=924684>. (It's a private
>> one, and I'm not at liberty to open it up, so my apologies to non-RH folks.)
>>
>> Anyway, given that "set" should be a subset of the "get" return value
>> (as documented in the command schema), if we fix up "get" to work with
>> sparse topologies, then "set" should work at once.
>>
>> However... as far as I understand, this change will allow
>> qmp_guest_get_vcpus() to produce a GuestLogicalProcessor object for the
>> missing (hot-unplugged) VCPU, with the following contents:
>> - @logical-id: populated by the loop,
>> - @online: false (from g_malloc0()),
>> - @can-offline: present (from the loop), and false (from g_malloc0()).
>>
>> The smaller problem with this might be that "online==false &&
>> can-offline==false" is nonsensical and has never been returned before. I
>> don't know how higher level apps will react.
>>
>> The larger problem might be that a higher level app could simply copy
>> this output structure into the next "set" call unchanged, and then that
>> "set" call will fail.
> Libvirt it seems that survives such outrageous output
> 
>> I wonder if, instead of this patch, we should rework
>> qmp_guest_get_vcpus(), to silently skip processors for which this
>> dirpath ENOENT condition arises (i.e., return a shorter list of
>> GuestLogicalProcessor objects).
> 
>> But, again, I wouldn't mix this guest agent command with real VCPU
>> hot(un)plug in the first place. The latter is much-much better, so if
>> it's available, use that exclusively?
> Agreed,
> 
> Maybe we can block invalid usecase on libvirt side with a more clear
> error message as libvirt sort of knows that sparse cpus are supported.

OK -- I see libvir-list is CC'd, so let's hear what they prefer :)

Thanks
Laszlo



Re: [Qemu-devel] [PATCH] qga: ignore non present cpus when handling qmp_guest_get_vcpus()

2018-09-06 Thread Laszlo Ersek
On 09/06/18 11:49, Igor Mammedov wrote:
> On Thu, 30 Aug 2018 17:51:13 +0200
> Laszlo Ersek  wrote:
> 
>> +Drew
>>
>> On 08/30/18 14:08, Igor Mammedov wrote:
>>> If VM has VCPUs plugged sparselly (for example a VM started with
>>> 3 VCPUs (cpu0, cpu1 and cpu2) and then cpu1 was hotunplugged so
>>> only cpu0 and cpu2 are present), QGA will rise a error
>>>   error: internal error: unable to execute QEMU agent command 
>>> 'guest-get-vcpus':
>>>   open("/sys/devices/system/cpu/cpu1/"): No such file or directory
>>> when
>>>   virsh vcpucount FOO --guest
>>> is executed.
>>> Fix it by ignoring non present CPUs when fetching CPUs status from sysfs.
>>>
>>> Signed-off-by: Igor Mammedov 
>>> ---
>>>  qga/commands-posix.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>> index 37e8a2d..2929872 100644
>>> --- a/qga/commands-posix.c
>>> +++ b/qga/commands-posix.c
>>> @@ -2044,7 +2044,9 @@ static void transfer_vcpu(GuestLogicalProcessor 
>>> *vcpu, bool sys2vcpu,
>>>vcpu->logical_id);
>>>  dirfd = open(dirpath, O_RDONLY | O_DIRECTORY);
>>>  if (dirfd == -1) {
>>> -error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>> +if (!(sys2vcpu && errno == ENOENT)) {
>>> +error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>> +}
>>>  } else {
>>>  static const char fn[] = "online";
>>>  int fd;
>>>   
> [...]
>  
>> I wonder if, instead of this patch, we should rework
>> qmp_guest_get_vcpus(), to silently skip processors for which this
>> dirpath ENOENT condition arises (i.e., return a shorter list of
>> GuestLogicalProcessor objects).
> would something like this on top of this patch do?
> 
> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
> index 2929872..990bb80 100644
> --- a/qga/commands-posix.c
> +++ b/qga/commands-posix.c
> @@ -2114,12 +2114,14 @@ GuestLogicalProcessorList *qmp_guest_get_vcpus(Error 
> **errp)
>  vcpu->logical_id = current++;
>  vcpu->has_can_offline = true; /* lolspeak ftw */
>  transfer_vcpu(vcpu, true, &local_err);
> -
> -entry = g_malloc0(sizeof *entry);
> -entry->value = vcpu;
> -
> -*link = entry;
> -link = &entry->next;
> +if (errno == ENOENT) {
> +g_free(vcpu);
> +} else {
> +entry = g_malloc0(sizeof *entry);
> +entry->value = vcpu;
> +*link = entry;
> +link = &entry->next;
> +}
>  }
>  
>  if (local_err == NULL) {
> 
> [...]
> 

To me that looks like the right approach, but the details should be
polished a bit:

- After we drop the vcpu object, "local_err" is still set, and would
terminate the loop in the next iteration.

- It seems like ENOENT can indeed only come from openat(), in
transfer_vcpu(), however, it would be nice if we could grab the error
code from the error object somehow, and not from the "errno" variable. I
vaguely recall this is what error classes were originally invented for,
but now we just use ERROR_CLASS_GENERIC_ERROR...

How about this: we could add a boolean output param to transfer_vcpu(),
called "fatal". Ignored when the function succeeds. When the function
fails (seen from "local_err"), the loop consults "fatal". If the error
is fatal, we act as before; otherwise, we drop the vcpu object, release
-- and zero out -- "local_err" as well, and continue. I think this is
more generic / safer than trying to infer the failure location from the
outside.

I'm not quite up to date on structured error propagation in QEMU, so
take the above with a grain of salt...

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH] qga: ignore non present cpus when handling qmp_guest_get_vcpus()

2018-09-06 Thread Laszlo Ersek
On 09/06/18 12:50, Igor Mammedov wrote:
> On Thu, 6 Sep 2018 12:26:12 +0200
> Laszlo Ersek  wrote:
> 
>> On 09/06/18 11:49, Igor Mammedov wrote:
>>> On Thu, 30 Aug 2018 17:51:13 +0200
>>> Laszlo Ersek  wrote:
>>>   
>>>> +Drew
>>>>
>>>> On 08/30/18 14:08, Igor Mammedov wrote:  
>>>>> If VM has VCPUs plugged sparselly (for example a VM started with
>>>>> 3 VCPUs (cpu0, cpu1 and cpu2) and then cpu1 was hotunplugged so
>>>>> only cpu0 and cpu2 are present), QGA will rise a error
>>>>>   error: internal error: unable to execute QEMU agent command 
>>>>> 'guest-get-vcpus':
>>>>>   open("/sys/devices/system/cpu/cpu1/"): No such file or directory
>>>>> when
>>>>>   virsh vcpucount FOO --guest
>>>>> is executed.
>>>>> Fix it by ignoring non present CPUs when fetching CPUs status from sysfs.
>>>>>
>>>>> Signed-off-by: Igor Mammedov 
>>>>> ---
>>>>>  qga/commands-posix.c | 4 +++-
>>>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>>>> index 37e8a2d..2929872 100644
>>>>> --- a/qga/commands-posix.c
>>>>> +++ b/qga/commands-posix.c
>>>>> @@ -2044,7 +2044,9 @@ static void transfer_vcpu(GuestLogicalProcessor 
>>>>> *vcpu, bool sys2vcpu,
>>>>>vcpu->logical_id);
>>>>>  dirfd = open(dirpath, O_RDONLY | O_DIRECTORY);
>>>>>  if (dirfd == -1) {
>>>>> -error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>>>> +if (!(sys2vcpu && errno == ENOENT)) {
>>>>> +error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>>>> +}
>>>>>  } else {
>>>>>  static const char fn[] = "online";
>>>>>  int fd;
>>>>> 
>>> [...]
>>>
>>>> I wonder if, instead of this patch, we should rework
>>>> qmp_guest_get_vcpus(), to silently skip processors for which this
>>>> dirpath ENOENT condition arises (i.e., return a shorter list of
>>>> GuestLogicalProcessor objects).  
>>> would something like this on top of this patch do?
>>>
>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>> index 2929872..990bb80 100644
>>> --- a/qga/commands-posix.c
>>> +++ b/qga/commands-posix.c
>>> @@ -2114,12 +2114,14 @@ GuestLogicalProcessorList 
>>> *qmp_guest_get_vcpus(Error **errp)
>>>  vcpu->logical_id = current++;
>>>  vcpu->has_can_offline = true; /* lolspeak ftw */
>>>  transfer_vcpu(vcpu, true, &local_err);
>>> -
>>> -entry = g_malloc0(sizeof *entry);
>>> -entry->value = vcpu;
>>> -
>>> -*link = entry;
>>> -link = &entry->next;
>>> +if (errno == ENOENT) {
>>> +g_free(vcpu);
>>> +} else {
>>> +entry = g_malloc0(sizeof *entry);
>>> +entry->value = vcpu;
>>> +*link = entry;
>>> +link = &entry->next;
>>> +}
>>>  }
>>>  
>>>  if (local_err == NULL) {
>>>
>>> [...]
>>>   
>>
>> To me that looks like the right approach, but the details should be
>> polished a bit:
>>
>> - After we drop the vcpu object, "local_err" is still set, and would
>> terminate the loop in the next iteration.
> local_error is not set due 'if (!(sys2vcpu && errno == ENOENT))'
> condition in the in the transfer_vcpu().

ah, sorry, you did say this was on top of your original patch, but I had
forgotten the details of that.

> the thing is that in case of vcpu2sys direction ENOENT is hard error.
> 
>> - It seems like ENOENT can indeed only come from openat(), in
>> transfer_vcpu(), however, it would be nice if we could grab the error
>> code from the error object somehow, and not from the "errno" variable. I
>> vaguely recall this is what error classes were originally invented for,
>> but now we just use ERROR_CLASS_GENERIC_ERROR...
> I've checked it and errno is preserved during error_setg_errno() call but
> not saved in Error, so I've dropped that idea.
>  
>> How about this: we could add a boolean output param to transfer_vcpu(),
>> called "fatal". Ignored when the function succeeds. When the function
>> fails (seen from "local_err"), the loop consults "fatal". If the error
>> is fatal, we act as before; otherwise, we drop the vcpu object, release
>> -- and zero out -- "local_err" as well, and continue. I think this is
>> more generic / safer than trying to infer the failure location from the
>> outside.
> It looked more uglier to me, so I've turned to libc style of reporting
> (assuming that g_free() doesn't touch errno ever)
> 
> But if you prefer using extra parameter, I'll respin patch with it.

Michael, what is your preference? I guess I'll be fine both ways.

Thanks,
Laszlo

> 
>> I'm not quite up to date on structured error propagation in QEMU, so
>> take the above with a grain of salt...
>>
>> Thanks,
>> Laszlo
> 




Re: [Qemu-devel] [PATCH v2] qga: ignore non present cpus when handling qmp_guest_get_vcpus()

2018-09-06 Thread Laszlo Ersek
On 09/06/18 14:51, Igor Mammedov wrote:
> If VM has VCPUs plugged sparselly (for example a VM started with
> 3 VCPUs (cpu0, cpu1 and cpu2) and then cpu1 was hotunplugged so
> only cpu0 and cpu2 are present), QGA will rise a error
>   error: internal error: unable to execute QEMU agent command 
> 'guest-get-vcpus':
>   open("/sys/devices/system/cpu/cpu1/"): No such file or directory
> when
>   virsh vcpucount FOO --guest
> is executed.
> Fix it by ignoring non present CPUs when fetching CPUs status from sysfs.
>
> Signed-off-by: Igor Mammedov 
> ---
> v2:
>   do not create CPU entry if cpu isn't present
>   (Laszlo Ersek )
> ---
>  qga/commands-posix.c | 115 
> ++-
>  1 file changed, 59 insertions(+), 56 deletions(-)
>
> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
> index 37e8a2d..42d30f0 100644
> --- a/qga/commands-posix.c
> +++ b/qga/commands-posix.c
> @@ -2035,61 +2035,56 @@ static long sysconf_exact(int name, const char 
> *name_str, Error **errp)
>   * Written members remain unmodified on error.
>   */
>  static void transfer_vcpu(GuestLogicalProcessor *vcpu, bool sys2vcpu,
> -  Error **errp)
> +  char *dirpath, Error **errp)
>  {
> -char *dirpath;
> +int fd;
> +int res;
>  int dirfd;
> +static const char fn[] = "online";
>
> -dirpath = g_strdup_printf("/sys/devices/system/cpu/cpu%" PRId64 "/",
> -  vcpu->logical_id);
>  dirfd = open(dirpath, O_RDONLY | O_DIRECTORY);
>  if (dirfd == -1) {
>  error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
> -} else {
> -static const char fn[] = "online";
> -int fd;
> -int res;
> -
> -fd = openat(dirfd, fn, sys2vcpu ? O_RDONLY : O_RDWR);
> -if (fd == -1) {
> -if (errno != ENOENT) {
> -error_setg_errno(errp, errno, "open(\"%s/%s\")", dirpath, 
> fn);
> -} else if (sys2vcpu) {
> -vcpu->online = true;
> -vcpu->can_offline = false;
> -} else if (!vcpu->online) {
> -error_setg(errp, "logical processor #%" PRId64 " can't be "
> -   "offlined", vcpu->logical_id);
> -} /* otherwise pretend successful re-onlining */
> -} else {
> -unsigned char status;
> -
> -res = pread(fd, &status, 1, 0);
> -if (res == -1) {
> -error_setg_errno(errp, errno, "pread(\"%s/%s\")", dirpath, 
> fn);
> -} else if (res == 0) {
> -error_setg(errp, "pread(\"%s/%s\"): unexpected EOF", dirpath,
> -   fn);
> -} else if (sys2vcpu) {
> -vcpu->online = (status != '0');
> -vcpu->can_offline = true;
> -} else if (vcpu->online != (status != '0')) {
> -status = '0' + vcpu->online;
> -if (pwrite(fd, &status, 1, 0) == -1) {
> -error_setg_errno(errp, errno, "pwrite(\"%s/%s\")", 
> dirpath,
> - fn);
> -}
> -} /* otherwise pretend successful re-(on|off)-lining */
> +return;
> +}
>
> -res = close(fd);
> -g_assert(res == 0);
> -}
> +fd = openat(dirfd, fn, sys2vcpu ? O_RDONLY : O_RDWR);
> +if (fd == -1) {
> +if (errno != ENOENT) {
> +error_setg_errno(errp, errno, "open(\"%s/%s\")", dirpath, fn);
> +} else if (sys2vcpu) {
> +vcpu->online = true;
> +vcpu->can_offline = false;
> +} else if (!vcpu->online) {
> +error_setg(errp, "logical processor #%" PRId64 " can't be "
> +   "offlined", vcpu->logical_id);
> +} /* otherwise pretend successful re-onlining */
> +} else {
> +unsigned char status;
> +
> +res = pread(fd, &status, 1, 0);
> +if (res == -1) {
> +error_setg_errno(errp, errno, "pread(\"%s/%s\")", dirpath, fn);
> +} else if (res == 0) {
> +error_setg(errp, "pread(\"%s/%s\"): unexpected EOF", dirpath,
> +   fn);
> +} else if (sys2vcpu) {
> +

Re: [Qemu-devel] [PATCH v10 6/6] tpm: add ACPI memory clear interface

2018-09-06 Thread Laszlo Ersek
On 09/06/18 19:23, Dr. David Alan Gilbert wrote:
> * Marc-André Lureau (marcandre.lur...@gmail.com) wrote:
>> Hi
>>
>> On Thu, Sep 6, 2018 at 1:42 PM Dr. David Alan Gilbert
>>  wrote:
>>>
>>> * Marc-André Lureau (marcandre.lur...@gmail.com) wrote:
 Hi

 On Thu, Sep 6, 2018 at 12:59 PM Dr. David Alan Gilbert
  wrote:
>
> * Marc-André Lureau (marcandre.lur...@gmail.com) wrote:
>> Hi
>>
>> On Thu, Sep 6, 2018 at 11:58 AM Igor Mammedov  
>> wrote:
>>>
>>> On Thu, 6 Sep 2018 07:50:09 +0400
>>> Marc-André Lureau  wrote:
>>>
 Hi

 On Tue, Sep 4, 2018 at 10:47 AM Igor Mammedov  
 wrote:
>
> On Fri, 31 Aug 2018 19:24:24 +0200
> Marc-André Lureau  wrote:
>
>> This allows to pass the last failing test from the Windows HLK TPM 
>> 2.0
>> TCG PPI 1.3 tests.
>>
>> The interface is described in the "TCG Platform Reset Attack
>> Mitigation Specification", chapter 6 "ACPI _DSM Function". According
>> to Laszlo, it's not so easy to implement in OVMF, he suggested to do
>> it in qemu instead.
>>
>> Signed-off-by: Marc-André Lureau 
>> ---
>>  hw/tpm/tpm_ppi.h |  2 ++
>>  hw/i386/acpi-build.c | 46 
>> 
>>  hw/tpm/tpm_crb.c |  1 +
>>  hw/tpm/tpm_ppi.c | 23 ++
>>  hw/tpm/tpm_tis.c |  1 +
>>  docs/specs/tpm.txt   |  2 ++
>>  hw/tpm/trace-events  |  3 +++
>>  7 files changed, 78 insertions(+)
>>
>> diff --git a/hw/tpm/tpm_ppi.h b/hw/tpm/tpm_ppi.h
>> index f6458bf87e..3239751e9f 100644
>> --- a/hw/tpm/tpm_ppi.h
>> +++ b/hw/tpm/tpm_ppi.h
>> @@ -23,4 +23,6 @@ typedef struct TPMPPI {
>>  bool tpm_ppi_init(TPMPPI *tpmppi, struct MemoryRegion *m,
>>hwaddr addr, Object *obj, Error **errp);
>>
>> +void tpm_ppi_reset(TPMPPI *tpmppi);
>> +
>>  #endif /* TPM_TPM_PPI_H */
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index c5e9a6e11d..2ab3e8fae7 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -1824,6 +1824,13 @@ build_tpm_ppi(TPMIf *tpm, Aml *dev)
>>  pprq = aml_name("PPRQ");
>>  pprm = aml_name("PPRM");
>>
>> +aml_append(dev,
>> +   aml_operation_region("TPP3", AML_SYSTEM_MEMORY,
>> +aml_int(TPM_PPI_ADDR_BASE + 
>> 0x15a),
>> +0x1));
>> +field = aml_field("TPP3", AML_BYTE_ACC, AML_NOLOCK, 
>> AML_PRESERVE);
>> +aml_append(field, aml_named_field("MOVV", 8));
>> +aml_append(dev, field);
>>  /*
>>   * DerefOf in Windows is broken with SYSTEM_MEMORY.  Use a 
>> dynamic
>>   * operation region inside of a method for getting FUNC[op].
>> @@ -2166,7 +2173,46 @@ build_tpm_ppi(TPMIf *tpm, Aml *dev)
>>  aml_append(ifctx, aml_return(aml_buffer(1, zerobyte)));
>>  }
>>  aml_append(method, ifctx);
>> +
>> +ifctx = aml_if(
>> +aml_equal(uuid,
>> +  
>> aml_touuid("376054ED-CC13-4675-901C-4756D7F2D45D")));
>> +{
>> +/* standard DSM query function */
>> +ifctx2 = aml_if(aml_equal(function, zero));
>> +{
>> +uint8_t byte_list[1] = { 0x03 };
>> +aml_append(ifctx2, aml_return(aml_buffer(1, 
>> byte_list)));
>> +}
>> +aml_append(ifctx, ifctx2);
>> +
>> +/*
>> + * TCG Platform Reset Attack Mitigation Specification 
>> 1.0 Ch.6
>> + *
>> + * Arg 2 (Integer): Function Index = 1
>> + * Arg 3 (Package): Arguments = Package: Type: Integer
>> + *  Operation Value of the Request
>> + * Returns: Type: Integer
>> + *  0: Success
>> + *  1: General Failure
>> + */
>> +ifctx2 = aml_if(aml_equal(function, one));
>> +{
>> +aml_append(ifctx2,
>> +   
>> aml_store(aml_derefof(aml_index(arguments, zero)),
>> + op));
>> +{
>> +aml_append(ifctx2, aml_store(op, 
>> aml_name("MOVV")));
>> +
>> +/* 0: s

Re: [Qemu-devel] [PATCH v2] qga: ignore non present cpus when handling qmp_guest_get_vcpus()

2018-09-07 Thread Laszlo Ersek
On 09/07/18 13:30, Igor Mammedov wrote:
> On Thu, 6 Sep 2018 16:13:52 +0200
> Laszlo Ersek  wrote:
> 
>> On 09/06/18 14:51, Igor Mammedov wrote:
>>> If VM has VCPUs plugged sparselly (for example a VM started with
>>> 3 VCPUs (cpu0, cpu1 and cpu2) and then cpu1 was hotunplugged so
>>> only cpu0 and cpu2 are present), QGA will rise a error
>>>   error: internal error: unable to execute QEMU agent command 
>>> 'guest-get-vcpus':
>>>   open("/sys/devices/system/cpu/cpu1/"): No such file or directory
>>> when
>>>   virsh vcpucount FOO --guest
>>> is executed.
>>> Fix it by ignoring non present CPUs when fetching CPUs status from sysfs.
>>>
>>> Signed-off-by: Igor Mammedov 
>>> ---
>>> v2:
>>>   do not create CPU entry if cpu isn't present
>>>   (Laszlo Ersek )
>>> ---
>>>  qga/commands-posix.c | 115 
>>> ++-
>>>  1 file changed, 59 insertions(+), 56 deletions(-)
>>>
>>> diff --git a/qga/commands-posix.c b/qga/commands-posix.c
>>> index 37e8a2d..42d30f0 100644
>>> --- a/qga/commands-posix.c
>>> +++ b/qga/commands-posix.c
>>> @@ -2035,61 +2035,56 @@ static long sysconf_exact(int name, const char 
>>> *name_str, Error **errp)
>>>   * Written members remain unmodified on error.
>>>   */
>>>  static void transfer_vcpu(GuestLogicalProcessor *vcpu, bool sys2vcpu,
>>> -  Error **errp)
>>> +  char *dirpath, Error **errp)
>>>  {
>>> -char *dirpath;
>>> +int fd;
>>> +int res;
>>>  int dirfd;
>>> +static const char fn[] = "online";
>>>
>>> -dirpath = g_strdup_printf("/sys/devices/system/cpu/cpu%" PRId64 "/",
>>> -  vcpu->logical_id);
>>>  dirfd = open(dirpath, O_RDONLY | O_DIRECTORY);
>>>  if (dirfd == -1) {
>>>  error_setg_errno(errp, errno, "open(\"%s\")", dirpath);
>>> -} else {
>>> -static const char fn[] = "online";
>>> -int fd;
>>> -int res;
>>> -
>>> -fd = openat(dirfd, fn, sys2vcpu ? O_RDONLY : O_RDWR);
>>> -if (fd == -1) {
>>> -if (errno != ENOENT) {
>>> -error_setg_errno(errp, errno, "open(\"%s/%s\")", dirpath, 
>>> fn);
>>> -} else if (sys2vcpu) {
>>> -vcpu->online = true;
>>> -vcpu->can_offline = false;
>>> -} else if (!vcpu->online) {
>>> -error_setg(errp, "logical processor #%" PRId64 " can't be "
>>> -   "offlined", vcpu->logical_id);
>>> -} /* otherwise pretend successful re-onlining */
>>> -} else {
>>> -unsigned char status;
>>> -
>>> -res = pread(fd, &status, 1, 0);
>>> -if (res == -1) {
>>> -error_setg_errno(errp, errno, "pread(\"%s/%s\")", dirpath, 
>>> fn);
>>> -} else if (res == 0) {
>>> -error_setg(errp, "pread(\"%s/%s\"): unexpected EOF", 
>>> dirpath,
>>> -   fn);
>>> -} else if (sys2vcpu) {
>>> -vcpu->online = (status != '0');
>>> -vcpu->can_offline = true;
>>> -} else if (vcpu->online != (status != '0')) {
>>> -status = '0' + vcpu->online;
>>> -if (pwrite(fd, &status, 1, 0) == -1) {
>>> -error_setg_errno(errp, errno, "pwrite(\"%s/%s\")", 
>>> dirpath,
>>> - fn);
>>> -}
>>> -} /* otherwise pretend successful re-(on|off)-lining */
>>> +return;
>>> +}
>>>
>>> -res = close(fd);
>>> -g_assert(res == 0);
>>> -}
>>> +fd = openat(dirfd, fn, sys2vcpu ? O_RDONLY : O_RDWR);
>>> +if (fd == -1) {
>>> +if (errno != ENOENT) {
>>> +error_setg_errno(errp, errno, "open(\"%s/%s\")", dirpath

Re: [Qemu-devel] [PATCH v10 6/6] tpm: add ACPI memory clear interface

2018-09-11 Thread Laszlo Ersek
+Alex, due to mention of 21e00fa55f3fd

On 09/10/18 15:03, Marc-André Lureau wrote:
> Hi
> 
> On Mon, Sep 10, 2018 at 2:44 PM Dr. David Alan Gilbert
>  wrote:
>> (I didn't know about guest_phys_block* and would have probably just used
>> qemu_ram_foreach_block )
>>
> 
> guest_phys_block*() seems to fit, as it lists only the blocks actually
> used, and already skip the device RAM.
> 
> Laszlo, you wrote the functions
> (https://git.qemu.org/?p=qemu.git;a=commit;h=c5d7f60f0614250bd925071e25220ce5958f75d0),
> do you think it's appropriate to list the memory to clear, or we
> should rather use qemu_ram_foreach_block() ?

Originally, I would have said, "use either, doesn't matter". Namely,
when I introduced the guest_phys_block*() functions, the original
purpose was not related to RAM *contents*, but to RAM *addresses*
(GPAs). This is evident if you look at the direct child commit of
c5d7f60f0614, namely 56c4bfb3f07f, which put GuestPhysBlockList to use.
And, for your use case (= wiping RAM), GPAs don't matter, only contents
matter.

However, with the commits I mentioned previously, namely e4dc3f5909ab9
and 21e00fa55f3fd, we now filter out some RAM blocks from the dumping
based on contents / backing as well. I think? So I believe we should
honor that for the wiping to. I guess I'd (vaguely) suggest using
guest_phys_block*().

(And then, as Dave suggests, maybe extend the filter to consider pmem
too, separately.)

Laszlo



Re: [Qemu-devel] [PATCH] fw_cfg_mem: add read memory region callback

2018-09-12 Thread Laszlo Ersek
On 09/12/18 10:02, Li Qiang wrote:
> Hi,
> 
> Marc-André Lureau  于2018年9月12日周三 下午3:16写道:
> 
>> Hi
>>
>> On Wed, Sep 12, 2018 at 9:22 AM Li Qiang  wrote:
>>>
>>> The write/read should be paired, this can avoid the
>>> NULL-deref while the guest reads the fw_cfg port.
>>>
>>> Signed-off-by: Li Qiang 
>>
>> Do you have a reproducer and/or a backtrace?
>> memory_region_dispatch_write() checks if ops->write != NULL.
>>
> 
> As far as I can see, the fw_cfg_mem is not used in x86 and used in arm.
> And from my impression, this will NULL read will be a issue. So I just
> omit the 'read' field in fw_cfg_comb_mem_ops(this is used for x86) to
> emulate the
> issue.
> 
> When using gdb, I got the following backtrack.
> 
> Thread 5 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffca15b700 (LWP 7637)]
> 0x in ?? ()
> (gdb) bt
> #0  0x in ?? ()
> #1  0x55872492 in memory_region_oldmmio_read_accessor
> (mr=0x567fa870, addr=1, value=0x7fffca158510, size=1, shift=0,
> mask=255, attrs=...) at /home/liqiang02/qemu-devel/qemu/memory.c:409
> #2  0x5586f2dd in access_with_adjusted_size (addr=addr@entry=1,
> value=value@entry=0x7fffca158510, size=size@entry=1,
> access_size_min=, access_size_max=,
> access_fn=0x55872440 ,
> mr=0x567fa870, attrs=...) at
> /home/liqiang02/qemu-devel/qemu/memory.c:593
> #3  0x55873e90 in memory_region_dispatch_read1 (attrs=..., size=1,
> pval=0x7fffca158510, addr=1, mr=0x567fa870) at
> /home/liqiang02/qemu-devel/qemu/memory.c:1404
> #4  memory_region_dispatch_read (mr=mr@entry=0x567fa870, addr=1,
> pval=pval@entry=0x7fffca158510, size=1, attrs=attrs@entry=...) at
> /home/liqiang02/qemu-devel/qemu/memory.c:1423
> #5  0x55821e42 in flatview_read_continue (fv=fv@entry=0x7fffbc03f370,
> addr=addr@entry=1297, attrs=..., buf=, buf@entry=0x77fee000
> "", len=len@entry=1, addr1=, l=,
> mr=0x567fa870)
> at /home/liqiang02/qemu-devel/qemu/exec.c:3293
> #6  0x55822006 in flatview_read (fv=0x7fffbc03f370, addr=1297,
> attrs=..., buf=0x77fee000 "", len=1) at
> /home/liqiang02/qemu-devel/qemu/exec.c:3331
> #7  0x5582211f in address_space_read_full (as=,
> addr=addr@entry=1297, attrs=..., buf=buf@entry=0x77fee000 "",
> len=len@entry=1) at /home/liqiang02/qemu-devel/qemu/exec.c:3344
> #8  0x5582225a in address_space_rw (as=,
> addr=addr@entry=1297, attrs=..., attrs@entry=..., buf=buf@entry=0x77fee000
> "", len=len@entry=1, is_write=is_write@entry=false)
> at /home/liqiang02/qemu-devel/qemu/exec.c:3374
> #9  0x55886239 in kvm_handle_io (count=1, size=1,
> direction=, data=, attrs=..., port=1297) at
> /home/liqiang02/qemu-devel/qemu/accel/kvm/kvm-all.c:1731
> #10 kvm_cpu_exec (cpu=cpu@entry=0x566e9990) at
> /home/liqiang02/qemu-devel/qemu/accel/kvm/kvm-all.c:1971
> #11 0x5585d3de in qemu_kvm_cpu_thread_fn (arg=0x566e9990) at
> /home/liqiang02/qemu-devel/qemu/cpus.c:1257
> #12 0x7fffdbd58494 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #13 0x7fffdba9aacf in clone () from /lib/x86_64-linux-gnu/libc.so.6
> 
> So I send out this path.
> 
> But this time when I not use gdb, there is no segmentation.
> 
> So this may not a issue.
> 
> If who has a arm environment, he can try this, the PoC is just easy. just
> read the 0x510 port with word.

I don't understand your description of how you managed to trigger this
SIGSEGV.

FWIW, looking at the codebase, there's a good number of static
MemoryRegionOps structures for which the "read_with_attrs" and "read"
members are default-initialized to NULL. It seems unlikely they are all
wrong.

- exec.c:notdirty_mem_ops, readonly_mem_ops
- hw/misc/debugexit.c:   debug_exit_ops
- hw/misc/hyperv_testdev.c:  synic_test_sint_ops
- hw/misc/pc-testdev.c:  test_irq_ops, test_flush_ops
- hw/pci-host/designware.c:  designware_pci_host_msi_ops
- hw/rdma/vmw/pvrdma_main.c: uar_ops
- hw/sparc64/sun4u.c:power_mem_ops

Laszlo



Re: [Qemu-devel] [PATCH] memory region: check the old.mmio.read status

2018-09-12 Thread Laszlo Ersek
On 09/12/18 14:54, Peter Maydell wrote:
> On 12 September 2018 at 13:32, Li Qiang  wrote:
>> To avoid NULL-deref for the devices without read callbacks
>>
>> Signed-off-by: Li Qiang 
>> ---
>>  memory.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/memory.c b/memory.c
>> index 9b73892768..48d025426b 100644
>> --- a/memory.c
>> +++ b/memory.c
>> @@ -406,6 +406,10 @@ static MemTxResult 
>> memory_region_oldmmio_read_accessor(MemoryRegion *mr,
>>  {
>>  uint64_t tmp;
>>
>> +if (!mr->ops->old_mmio.read[ctz32(size)]) {
>> +return MEMTX_DECODE_ERROR;
>> +}
>> +
>>  tmp = mr->ops->old_mmio.read[ctz32(size)](mr->opaque, addr);
>>  if (mr->subpage) {
>>  trace_memory_region_subpage_read(get_cpu_index(), mr, addr, tmp, 
>> size);
>> --
>> 2.11.0
>>
> 
> There's patches on-list which drop the old_mmio field from the MemoryRegion
> struct entirely, so I think this patch as it stands is obsolete.
> 
> Currently our semantics are "you must provide both read and write, even
> if one of them just always returns 0 / does nothing / returns an error".

That's new to me. Has this always been the case? There are several
static MemoryRegionOps structures that don't conform. (See the end of my
other email:
<84da6f02-1f60-4bc7-92da-6a7f74deded3@redhat.com">http://mid.mail-archive.com/84da6f02-1f60-4bc7-92da-6a7f74deded3@redhat.com>.)
Beyond the one that Li Qiang reported directly ("fw_cfg_ctl_mem_read").

Are all of those ops guest-triggerable QEMU crashers?

> We could probably reasonably assert this at the point when the
> MemoryRegionOps is registered.

Apparently, we should have...

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH] memory region: check the old.mmio.read status

2018-09-12 Thread Laszlo Ersek
On 09/12/18 16:28, Li Qiang wrote:
> Peter Maydell  于2018年9月12日周三 下午8:55写道:
> 
>> On 12 September 2018 at 13:32, Li Qiang  wrote:
>>> To avoid NULL-deref for the devices without read callbacks
>>>
>>> Signed-off-by: Li Qiang 
>>> ---
>>>  memory.c | 4 
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/memory.c b/memory.c
>>> index 9b73892768..48d025426b 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -406,6 +406,10 @@ static MemTxResult
>> memory_region_oldmmio_read_accessor(MemoryRegion *mr,
>>>  {
>>>  uint64_t tmp;
>>>
>>> +if (!mr->ops->old_mmio.read[ctz32(size)]) {
>>> +return MEMTX_DECODE_ERROR;
>>> +}
>>> +
>>>  tmp = mr->ops->old_mmio.read[ctz32(size)](mr->opaque, addr);
>>>  if (mr->subpage) {
>>>  trace_memory_region_subpage_read(get_cpu_index(), mr, addr,
>> tmp, size);
>>> --
>>> 2.11.0
>>>
>>
>> There's patches on-list which drop the old_mmio field from the MemoryRegion
>> struct entirely, so I think this patch as it stands is obsolete.
>>
>> Currently our semantics are "you must provide both read and write, even
>> if one of them just always returns 0 / does nothing / returns an error".
>> We could probably reasonably assert this at the point when the
>> MemoryRegionOps is registered.
>>
> 
> This patch is sent as the results of this thread:
> -->https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg01332.html
> 
> So I think  I should send a path set to add all the missing read function
> as Laszlo Ersek points
> in the above thread discussion, right?

Can we introduce a central utility function at least (for a no-op read
returning 0), and initialize the read callbacks in question with the
address of that function?

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH v10 6/6] tpm: add ACPI memory clear interface

2018-09-20 Thread Laszlo Ersek
On 09/20/18 15:20, Igor Mammedov wrote:
> On Thu, 20 Sep 2018 10:20:49 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
>> * Eduardo Habkost (ehabk...@redhat.com) wrote:
>>> On Wed, Sep 19, 2018 at 08:15:25PM +0100, Dr. David Alan Gilbert wrote:  
>>>> * Igor Mammedov (imamm...@redhat.com) wrote:  
>>>>> On Wed, 19 Sep 2018 13:14:05 +0100
>>>>> "Dr. David Alan Gilbert"  wrote:
>>>>>   
>>>>>> * Igor Mammedov (imamm...@redhat.com) wrote:  
>>>>>>> On Wed, 19 Sep 2018 11:58:22 +0100
>>>>>>> "Dr. David Alan Gilbert"  wrote:
>>>>>>> 
>>>>>>>> * Marc-André Lureau (marcandre.lur...@gmail.com) wrote:
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> On Tue, Sep 18, 2018 at 7:49 PM Dr. David Alan Gilbert
>>>>>>>>>  wrote:  
>>>>>>>>>>
>>>>>>>>>> * Marc-André Lureau (marcandre.lur...@gmail.com) wrote:  
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 11, 2018 at 6:19 PM Laszlo Ersek  
>>>>>>>>>>> wrote:  
>>>>>>>>>>>>
>>>>>>>>>>>> +Alex, due to mention of 21e00fa55f3fd
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/10/18 15:03, Marc-André Lureau wrote:  
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 10, 2018 at 2:44 PM Dr. David Alan Gilbert
>>>>>>>>>>>>>  wrote:  
>>>>>>>>>>>>>> (I didn't know about guest_phys_block* and would have probably 
>>>>>>>>>>>>>> just used
>>>>>>>>>>>>>> qemu_ram_foreach_block )
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>
>>>>>>>>>>>>> guest_phys_block*() seems to fit, as it lists only the blocks 
>>>>>>>>>>>>> actually
>>>>>>>>>>>>> used, and already skip the device RAM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Laszlo, you wrote the functions
>>>>>>>>>>>>> (https://git.qemu.org/?p=qemu.git;a=commit;h=c5d7f60f0614250bd925071e25220ce5958f75d0),
>>>>>>>>>>>>> do you think it's appropriate to list the memory to clear, or we
>>>>>>>>>>>>> should rather use qemu_ram_foreach_block() ?  
>>>>>>>>>>>>
>>>>>>>>>>>> Originally, I would have said, "use either, doesn't matter". 
>>>>>>>>>>>> Namely,
>>>>>>>>>>>> when I introduced the guest_phys_block*() functions, the original
>>>>>>>>>>>> purpose was not related to RAM *contents*, but to RAM *addresses*
>>>>>>>>>>>> (GPAs). This is evident if you look at the direct child commit of
>>>>>>>>>>>> c5d7f60f0614, namely 56c4bfb3f07f, which put GuestPhysBlockList to 
>>>>>>>>>>>> use.
>>>>>>>>>>>> And, for your use case (= wiping RAM), GPAs don't matter, only 
>>>>>>>>>>>> contents
>>>>>>>>>>>> matter.
>>>>>>>>>>>>
>>>>>>>>>>>> However, with the commits I mentioned previously, namely 
>>>>>>>>>>>> e4dc3f5909ab9
>>>>>>>>>>>> and 21e00fa55f3fd, we now filter out some RAM blocks from the 
>>>>>>>>>>>> dumping
>>>>>>>>>>>> based on contents / backing as well. I think? So I believe we 
>>>>>>>>>>>> should
>>>>>>>>>>>> honor that for the wiping to. I guess I'd (vaguely) suggest using
>>>>>>>>>>>> guest_phys_block*().
>>>>>>>>>>>>
>>>>>>>>>>>> (And the

[Qemu-devel] 64-bit MMIO aperture expansion

2018-09-20 Thread Laszlo Ersek
Hi Marcel,

this email should actually be an RFC patch. But RFC patches tend to turn
into real PATCHes (if the submitter is lucky, that is), and I can't
really promise sending multiple versions of a PATCH at this time. So
please consider this a "maybe bug report".

In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
hole", 2017-11-16) we added logic so that QEMU expand the 64-bit PCI
hole (for hotplug purposes), if (a) the firmware doesn't "configure" one
(via programming individual BARs with 64-bit addresses), or (b) the
firmware's programming results in an aperture smaller than we'd like
(32GB on Q35).

We made sure that the aperture required by the firmware's programming
would never be shrunken or otherwise truncated by QEMU, so that's fine.
However, the expansion doesn't work as "widely" in all cases as it
should.

Consider the following three functions, at current master (= commit
19b599f7664b):

[hw/i386/pc.c]

> /*
>  * The 64bit pci hole starts after "above 4G RAM" and
>  * potentially the space reserved for memory hotplug.
>  */
> uint64_t pc_pci_hole64_start(void)
> {
> PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
> PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> MachineState *ms = MACHINE(pcms);
> uint64_t hole64_start = 0;
>
> if (pcmc->has_reserved_memory && ms->device_memory->base) {
> hole64_start = ms->device_memory->base;
> if (!pcmc->broken_reserved_end) {
> hole64_start += memory_region_size(&ms->device_memory->mr);
> }
> } else {
> hole64_start = 0x1ULL + pcms->above_4g_mem_size;
> }
>
> return ROUND_UP(hole64_start, 1 * GiB);
> }

[hw/pci-host/q35.c]

> /*
>  * The 64bit PCI hole start is set by the Guest firmware
>  * as the address of the first 64bit PCI MEM resource.
>  * If no PCI device has resources on the 64bit area,
>  * the 64bit PCI hole will start after "over 4G RAM" and the
>  * reserved space for memory hotplug if any.
>  */
> static void q35_host_get_pci_hole64_start(Object *obj, Visitor *v,
>   const char *name, void *opaque,
>   Error **errp)
> {
> PCIHostState *h = PCI_HOST_BRIDGE(obj);
> Q35PCIHost *s = Q35_HOST_DEVICE(obj);
> Range w64;
> uint64_t value;
>
> pci_bus_get_w64_range(h->bus, &w64);
> value = range_is_empty(&w64) ? 0 : range_lob(&w64);
> if (!value && s->pci_hole64_fix) {
> value = pc_pci_hole64_start();
> }
> visit_type_uint64(v, name, &value, errp);
> }
>
> /*
>  * The 64bit PCI hole end is set by the Guest firmware
>  * as the address of the last 64bit PCI MEM resource.
>  * Then it is expanded to the PCI_HOST_PROP_PCI_HOLE64_SIZE
>  * that can be configured by the user.
>  */
> static void q35_host_get_pci_hole64_end(Object *obj, Visitor *v,
> const char *name, void *opaque,
> Error **errp)
> {
> PCIHostState *h = PCI_HOST_BRIDGE(obj);
> Q35PCIHost *s = Q35_HOST_DEVICE(obj);
> uint64_t hole64_start = pc_pci_hole64_start();
> Range w64;
> uint64_t value, hole64_end;
>
> pci_bus_get_w64_range(h->bus, &w64);
> value = range_is_empty(&w64) ? 0 : range_upb(&w64) + 1;
> hole64_end = ROUND_UP(hole64_start + s->mch.pci_hole64_size, 1ULL << 30);
> if (s->pci_hole64_fix && value < hole64_end) {
> value = hole64_end;
> }
> visit_type_uint64(v, name, &value, errp);
> }
>

Now consider the following scenario:

- the firmware programs some BARs with 64-bit addresses such that the
  aperture that we deduce starts at 32GB,

- the guest has 4GB of RAM, and no DIMM hotplug range.

Consequences:

- Because the "32-bit RAM split" for Q35 is at 2GB, the
  pc_pci_hole64_start() function will return 6GB.

- The q35_host_get_pci_hole64_start() function will return 32GB. (It
  will not fall back to pc_pci_hole64_start() -- correctly -- because
  the firmware has programmed some BARs with 64-bit addresses.)

- The q35_host_get_pci_hole64_end() function *intends* to return 64GB,
  because -- let's say -- the guest assigned BARs covering the
  32GB..34GB range, which is 2GB in size, and we *intend* to round that
  size up to 32GB, so that 30GB be left for hotplug purposes. (This is
  the original intent of commit 9fa99d2519cb.)

- However, because we initialize "hole64_start" from
  pc_pci_hole64_start(), and not from q35_host_get_pci_hole64_start(),
  we add "mch.pci_hole64_size" (32GB by default) to 6GB (the end of
  RAM), and not to 32GB (the aperture base deduced from the firmware's
  programming). As a result, we'll extend the aperture end address only
  to 38GB, and not to 64GB.

My suggestion is simply to initialize "hole64_start" from
q35_host_get_pci_hole64_start(), in the q35_host_get_pci_hole64_end()
function. If the firmware doesn't program 64-bit addresses, then this
change is a no-op -- q35_host_get_

Re: [Qemu-devel] 64-bit MMIO aperture expansion

2018-09-21 Thread Laszlo Ersek
On 09/21/18 17:01, Marcel Apfelbaum wrote:
> On 09/20/2018 05:49 PM, Laszlo Ersek wrote:

> I had to read this mail a few times...

Sorry :)


>> Now consider the following scenario:
>>
>> - the firmware programs some BARs with 64-bit addresses such that the
>>aperture that we deduce starts at 32GB,
>>
>> - the guest has 4GB of RAM, and no DIMM hotplug range.
>>
>> Consequences:
>>
>> - Because the "32-bit RAM split" for Q35 is at 2GB, the
>>pc_pci_hole64_start() function will return 6GB.
>>
>> - The q35_host_get_pci_hole64_start() function will return 32GB. (It
>>will not fall back to pc_pci_hole64_start() -- correctly --
>>because the firmware has programmed some BARs with 64-bit
>>addresses.)
>>
>> - The q35_host_get_pci_hole64_end() function *intends* to return
>>64GB, because -- let's say -- the guest assigned BARs covering the
>>32GB..34GB range, which is 2GB in size, and we *intend* to round
>>that size up to 32GB, so that 30GB be left for hotplug purposes.
>>(This is the original intent of commit 9fa99d2519cb.)
>> - However, because we initialize "hole64_start" from
>>pc_pci_hole64_start(), and not from
>>q35_host_get_pci_hole64_start(), we add "mch.pci_hole64_size"
>>(32GB by default) to 6GB (the end of RAM), and not to 32GB (the
>>aperture base deduced from the firmware's programming). As a
>>result, we'll extend the aperture end address only to 38GB, and
>>not to 64GB.
>
> Right, there is no reason to use pc_pci_hole64_start, it looks
> like a plain bug. We diverged from pc and the fact that
> q35_host_get_pci_hole64_star uses it is only an implementation
> detail.

Small correction (not affecting the main point):

If you compare q35_host_get_pci_hole64_end() and
i440fx_pcihost_get_pci_hole64_end(), you see that they do the exact same
thing, and pc_pci_hole64_start() is a common helper function that they
both call.

This is also matched by the files that define these functions:
- i440fx_pcihost_get_pci_hole64_end(): hw/pci-host/piix.c
- q35_host_get_pci_hole64_end():   hw/pci-host/q35.c
- pc_pci_hole64_start():   hw/i386/pc.c

In that sense, we didn't "diverge" from PC. Because, both i440fx and q35
received the exact same logic in 9fa99d2519cb. They both call the common
pc_pci_hole64_start() helper function as an internal / implementation
detail. (And both of them should be fixed.)

Current function call chains:

  i440fx_pcihost_get_pci_hole64_start()| 
q35_host_get_pci_hole64_start()
pc_pci_hole64_start() [good call]  |   
pc_pci_hole64_start()   [good call]
   |
  i440fx_pcihost_get_pci_hole64_end()  | 
q35_host_get_pci_hole64_end()
pc_pci_hole64_start() [bug]|   
pc_pci_hole64_start()   [bug]

Proposed call chains:

  i440fx_pcihost_get_pci_hole64_start()| 
q35_host_get_pci_hole64_start()
pc_pci_hole64_start() [unchanged call] |   
pc_pci_hole64_start()   [unchanged call]
   |
  i440fx_pcihost_get_pci_hole64_end()  | 
q35_host_get_pci_hole64_end()
i440fx_pcihost_get_pci_hole64_start() [corrected call] |   
q35_host_get_pci_hole64_start() [corrected call]
  pc_pci_hole64_start()   [unchanged call] | 
pc_pci_hole64_start() [unchanged call]


>> The same would apply to i440fx too.
>
> I am lost here. The q35 PCI 64bit hole computation issue starts from
> the miss-use of the PC conterpart functions.

I disagree. If q35 called an i440fx-specific function, that would indeed
be mis-use. However, in this case, the callee -- that is,
pc_pci_hole64_start() -- is not an i440fx-specific function. It is a
common helper for both boards. Both boards can rightfully call it.

Instead, the issue here is that *both*
i440fx_pcihost_get_pci_hole64_end() and q35_host_get_pci_hole64_end()
need a "hole64_start" value that accounts for the BAR addresses that
were programmed by the firmware. pc_pci_hole64_start() simply doesn't do
that.

> What is the problem with the PC?

i440fx_pcihost_get_pci_hole64_end() currently calls
pc_pci_hole64_start() directly, so the "hole64_start" value does not
honor the BAR addresses set by the firmware.

Thanks,
Laszlo



Re: [Qemu-devel] 64-bit MMIO aperture expansion

2018-09-21 Thread Laszlo Ersek
On 09/21/18 19:20, Michael S. Tsirkin wrote:
> On Fri, Sep 21, 2018 at 06:01:30PM +0300, Marcel Apfelbaum wrote:
>> On 09/20/2018 05:49 PM, Laszlo Ersek wrote:

>>> Now, there's another complication, obviously -- machine type compat. In
>>> commit 9fa99d2519cb, we added the "pci_hole64_fix" compat property. I
>>> assume the additional fix I'm proposing requires another compat
>>> property?
>>
>> We have to, is a guest visible change. I really don't like these compat
>> properties, but I don't see a way around it.
> 
> Well does it only affect ACPI? Or other stuff? ACPI changes
> are mostly safe without need for compat things.

My understanding is that it affects ACPI only:

(1) q35_host_get_pci_hole64_end() is only referenced in the code when it
is set as a getter for the PCI_HOST_PROP_PCI_HOLE64_END property, in
q35_host_initfn() [hw/pci-host/q35.c].

(2) i440fx_pcihost_get_pci_hole64_end() is only referenced in the code
when it is set as a getter for the same PCI_HOST_PROP_PCI_HOLE64_END
property, in i440fx_pcihost_initfn() [hw/pci-host/piix.c].

(3) The PCI_HOST_PROP_PCI_HOLE64_END property is only fetched in
acpi_get_pci_holes() [hw/i386/acpi-build.c].

(4) acpi_get_pci_holes() is only called in acpi_build()
[hw/i386/acpi-build.c].

(5) The resultant "pci_hole64" structure is passed to build_dsdt() only.

Thanks
Laszlo



Re: [Qemu-devel] [RFC] Virtio RNG: Consider changing the default entropy source to /dev/urandom?

2018-09-21 Thread Laszlo Ersek
On 09/21/18 17:43, Kashyap Chamarthy wrote:
> Hi folks,
> 
> As Markus pointed out in this 'qemu-devel' thread[1],
> backends/rng-random.c uses '/dev/random' in TYPE_RNG_RANDOM's
> instance_init() method:
> 
> [...]
> static void rng_random_init(Object *obj)
> {
> RngRandom *s = RNG_RANDOM(obj);
> 
> object_property_add_str(obj, "filename",
> rng_random_get_filename,
> rng_random_set_filename,
> NULL);
> 
> s->filename = g_strdup("/dev/random");
> s->fd = -1;
> }
> [...]
> 
> And I've looked at hw/virtio/virtio-rng.c:
> 
> [...]
> static void virtio_rng_device_realize(DeviceState *dev, Error **errp)
> {
> [...]
> 
> if (vrng->conf.rng == NULL) {
> vrng->conf.default_backend = 
> RNG_RANDOM(object_new(TYPE_RNG_RANDOM));
> [...]
> 
> From the above, I'm assuming QEMU uses `/dev/random` as the _default_
> entropy source for a 'virtio-rng-pci' device.  If my assumption is
> correct, any reason why not to change the default entropy source for
> 'virtio-rng-pci' devices to `/dev/urandom` (which is the preferred[2]
> source of entropy)?
> 
> And I understand (thanks: Eric Blake for correcting my confusion) that
> there are two cases to distinguish:
> 
> (a) When QEMU needs a random number, the entropy source it chooses.
> IIUC, the answer is: QEMU defers to GnuTLS by default, which uses
> getrandom(2), which in turn uses '/dev/urandom' as its entropy
> source; if getrandom(2) isn't available, GnuTLS uses `/dev/urandom`
> anyway.  (Thanks: Nikos for clarifying this.)
> 
> If QEMU is built with GnuTLS _disabled_, which I'm not sure if any
> Linux distribution does, then it uses libgcrypt, which in turn uses
> the undesired and legacy `/dev/random` as the default entropy
> source.
> 
> (b) When QEMU exposes a Virtio RNG device to the guest, that device
> needs a source of entropy, and IIUC, that source needs to be
> "non-blocking" (i.e. `/dev/urandom`).  However, currently QEMU
> defaults to the problematic `/dev/random`.
> 
> I'd like to get some more clarity on case (b).  
> 
> 
> [1] https://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg08335.html
> -- RNG: Any reason QEMU doesn't default to `/dev/urandom`
> 
> [2] http://man7.org/linux/man-pages/man4/urandom.4.html
> 
> 

The libvirt domain documentation 
also says,

"When no file name is specified, the hypervisor default is used. For
QEMU, the default is /dev/random. However, the recommended source of
entropy is /dev/urandom (as it doesn't have the limitations of
/dev/random)."

Thanks
Laszlo



[Qemu-devel] [PATCH 1/2] hw/pci-host/x86: extract get_pci_hole64_start_value() helpers

2018-09-24 Thread Laszlo Ersek
Expose the calculated "hole64 start" GPAs as plain uint64_t values,
extracting the internals of the current property getters.

This patch doesn't change behavior.

Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Marcel Apfelbaum 
Signed-off-by: Laszlo Ersek 
---
 hw/pci-host/piix.c | 15 +++
 hw/pci-host/q35.c  | 15 +++
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index 0e608347c1f0..0df91e002076 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -249,9 +249,7 @@ static void i440fx_pcihost_get_pci_hole_end(Object *obj, 
Visitor *v,
  * the 64bit PCI hole will start after "over 4G RAM" and the
  * reserved space for memory hotplug if any.
  */
-static void i440fx_pcihost_get_pci_hole64_start(Object *obj, Visitor *v,
-const char *name,
-void *opaque, Error **errp)
+static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
 {
 PCIHostState *h = PCI_HOST_BRIDGE(obj);
 I440FXState *s = I440FX_PCI_HOST_BRIDGE(obj);
@@ -263,7 +261,16 @@ static void i440fx_pcihost_get_pci_hole64_start(Object 
*obj, Visitor *v,
 if (!value && s->pci_hole64_fix) {
 value = pc_pci_hole64_start();
 }
-visit_type_uint64(v, name, &value, errp);
+return value;
+}
+
+static void i440fx_pcihost_get_pci_hole64_start(Object *obj, Visitor *v,
+const char *name,
+void *opaque, Error **errp)
+{
+uint64_t hole64_start = i440fx_pcihost_get_pci_hole64_start_value(obj);
+
+visit_type_uint64(v, name, &hole64_start, errp);
 }
 
 /*
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 02f95765880a..8acf942b5e65 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -109,9 +109,7 @@ static void q35_host_get_pci_hole_end(Object *obj, Visitor 
*v,
  * the 64bit PCI hole will start after "over 4G RAM" and the
  * reserved space for memory hotplug if any.
  */
-static void q35_host_get_pci_hole64_start(Object *obj, Visitor *v,
-  const char *name, void *opaque,
-  Error **errp)
+static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
 {
 PCIHostState *h = PCI_HOST_BRIDGE(obj);
 Q35PCIHost *s = Q35_HOST_DEVICE(obj);
@@ -123,7 +121,16 @@ static void q35_host_get_pci_hole64_start(Object *obj, 
Visitor *v,
 if (!value && s->pci_hole64_fix) {
 value = pc_pci_hole64_start();
 }
-visit_type_uint64(v, name, &value, errp);
+return value;
+}
+
+static void q35_host_get_pci_hole64_start(Object *obj, Visitor *v,
+  const char *name, void *opaque,
+  Error **errp)
+{
+uint64_t hole64_start = q35_host_get_pci_hole64_start_value(obj);
+
+visit_type_uint64(v, name, &hole64_start, errp);
 }
 
 /*
-- 
2.14.1.3.gb7cf6e02401b





[Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-24 Thread Laszlo Ersek
In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in
the ACPI DSDT that would be at least as large as the new "pci-hole64-size"
property (2GB on i440fx, 32GB on q35). The goal was to offer "enough"
64-bit MMIO aperture to the guest OS for hotplug purposes.

In that commit, we added or modified five functions:

- pc_pci_hole64_start(): shared between i440fx and q35. Provides a default
  64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips
  the DIMM hotplug area too (if any).

- i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start():
  board-specific 64-bit base property getters called abstractly by the
  ACPI generator. Both of these fall back to pc_pci_hole64_start() if the
  firmware didn't program any 64-bit hole (i.e. if the firmware didn't
  assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they
  honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit
  GPA programmed by the firmware as the base address for the aperture).

- i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end():
  these intended to extend the aperture to our size recommendation,
  calculated relative to the base of the aperture.

Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and
q35_host_get_pci_hole64_end() currently only extend the aperture relative
to the default base (pc_pci_hole64_start()), ignoring any programming done
by the firmware. This means that our size recommendation may not be met.
Fix it by honoring the firmware's address assignments.

The strange extension sizes were spotted by Alex, in the log of a guest
kernel running on top of OVMF (which does assign 64-bit GPAs to BARs).

This change only affects DSDT generation, therefore no new compat property
is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to
64-bit BARs, the patch makes no difference to SeaBIOS guests. (Which is in
turn why ACPI test data for the "bios-tables-test" need not be refreshed.)

Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is:

> @@ -881,9 +881,9 @@
>  QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
> Cacheable, ReadWrite,
>  0x, // Granularity
>  0x0008, // Range Minimum
> -0x00080001C0FF, // Range Maximum
> +0x00087FFF, // Range Maximum
>  0x, // Translation Offset
> -0x0001C100, // Length
> +0x8000, // Length
>  ,, , AddressRangeMemory, TypeStatic)
>  })
>  Device (GPE0)

(On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB
guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 +
(5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below
the firmware-programmed base of 32GB, before the patch. Therefore, before
the patch, the extension is ineffective. After the patch, we add the 2GB
extension to the firmware-programmed base, namely 32GB.)

Using a q35 OVMF guest with 5GB RAM, an example _CRS change is:

> @@ -3162,9 +3162,9 @@
>  QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
> Cacheable, ReadWrite,
>  0x, // Granularity
>  0x0008, // Range Minimum
> -0x0009BFFF, // Range Maximum
> +0x000F, // Range Maximum
>  0x, // Translation Offset
> -0x0001C000, // Length
> +0x0008, // Length
>  ,, , AddressRangeMemory, TypeStatic)
>  })
>  Device (GPE0)

(On Q35, the low RAM split is at 2GB. Therefore, with 5GB guest RAM and no
DIMM hotplug range, pc_pci_hole64_start() returns 4 + (5-2) = 7 GB. Adding
the 32GB extension to that yields 39GB (0x_0009_BFFF_ + 1), before
the patch. After the patch, we add the 32GB extension to the
firmware-programmed base, namely 32GB.)

Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Marcel Apfelbaum 
Link: 
a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com">http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com
Fixes: 9fa99d2519cbf71f871e46871df12cb446dc1c3e
Signed-off-by: Laszlo Ersek 
---
 hw/pci-host/piix.c | 2 +-
 hw/pci-host/q35.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci-host/piix.c b/hw/pci-host/piix.c
index 0df91e002076..fb4b0669ac9f 100644
--- a/hw/pci-host/piix.c
+++ b/hw/pci-host/piix.c
@@ -285,7 +285,7 @@ static void i440fx_pcihost_get_pci_hole64_end(Object *obj, 
Visitor *v,
 {
 PCIHostState 

[Qemu-devel] [PATCH 0/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-24 Thread Laszlo Ersek
This is based on the discussion in the "[Qemu-devel] 64-bit MMIO
aperture expansion" thread, which starts at
<a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com">http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com>.

Cc: "Michael S. Tsirkin" 
Cc: Alex Williamson 
Cc: Marcel Apfelbaum 

Thanks
Laszlo

Laszlo Ersek (2):
  hw/pci-host/x86: extract get_pci_hole64_start_value() helpers
  hw/pci-host/x86: extend the 64-bit PCI hole relative to the
fw-assigned base

 hw/pci-host/piix.c | 17 -
 hw/pci-host/q35.c  | 17 -
 2 files changed, 24 insertions(+), 10 deletions(-)

-- 
2.14.1.3.gb7cf6e02401b




Re: [Qemu-devel] [PATCH 0/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-25 Thread Laszlo Ersek
On 09/25/18 17:04, Michael S. Tsirkin wrote:
> On Tue, Sep 25, 2018 at 12:13:44AM +0200, Laszlo Ersek wrote:
>> This is based on the discussion in the "[Qemu-devel] 64-bit MMIO
>> aperture expansion" thread, which starts at
>> <a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com">http://mid.mail-archive.com/a56b3710-9c2d-9ad0-5590-efe30b6d7bd9@redhat.com>.
>>
>> Cc: "Michael S. Tsirkin" 
>> Cc: Alex Williamson 
>> Cc: Marcel Apfelbaum 
> 
> Mentioning
> https://bugs.launchpad.net/qemu/+bug/1778350
> 
> here - do any of these patches help?

Thanks for the reference.

I'm going to add an RFT (request for testing) to that LP soon. However,
I find your remark
<https://bugs.launchpad.net/qemu/+bug/1778350/comments/6> instructive:
"I looked at it and while I might be wrong, I suspect it's a bug in ACPI
parser in that version of Linux."

In the ACPI builder, we create the qword memory descriptor for the _CRS
only if the 64-bit hole is not empty.

(The exact expression for gating the descriptor's generation has gone
through a number of iterations, but AFAICS, the condition was first
added in commit 60efd4297d44, "pc: acpi-build: create PCI0._CRS
dynamically", 2015-03-01.)

Therefore, if an ACPI parser chokes on a qword memory descriptor in a
_CRS in general, then 9fa99d2519cb would trigger that issue. And,
setting "x-pci-hole64-fix=off" would mask it again.

This series does not change *when* the memory descriptor is generated;
it only changes *how* (with what contents) it is generated, when it is
generated. So I don't expect it to make a difference for LP#1778350.

But, I'll ask the reporter to apply this and test it.

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-25 Thread Laszlo Ersek
On 09/25/18 22:36, Alex Williamson wrote:
> On Tue, 25 Sep 2018 00:13:46 +0200
> Laszlo Ersek  wrote:
> 
>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in
>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size"
>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough"
>> 64-bit MMIO aperture to the guest OS for hotplug purposes.
>>
>> In that commit, we added or modified five functions:
>>
>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default
>>   64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips
>>   the DIMM hotplug area too (if any).
>>
>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start():
>>   board-specific 64-bit base property getters called abstractly by the
>>   ACPI generator. Both of these fall back to pc_pci_hole64_start() if the
>>   firmware didn't program any 64-bit hole (i.e. if the firmware didn't
>>   assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they
>>   honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit
>>   GPA programmed by the firmware as the base address for the aperture).
>>
>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end():
>>   these intended to extend the aperture to our size recommendation,
>>   calculated relative to the base of the aperture.
>>
>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and
>> q35_host_get_pci_hole64_end() currently only extend the aperture relative
>> to the default base (pc_pci_hole64_start()), ignoring any programming done
>> by the firmware. This means that our size recommendation may not be met.
>> Fix it by honoring the firmware's address assignments.
>>
>> The strange extension sizes were spotted by Alex, in the log of a guest
>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs).
>>
>> This change only affects DSDT generation, therefore no new compat property
>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to
>> 64-bit BARs, the patch makes no difference to SeaBIOS guests.
> 
> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but
> only if it cannot satisfy all the BARs from 32-bit MMMIO, see
> src/fw/pciinit.c:pci_bios_map_devices.

Indeed, this mistake in the commit message briefly crossed my mind, some
time after posting the series. Then I promptly forgot about it again. :/

(Because, if SeaBIOS really never picked 64-bit addresses, then commit
9fa99d2519cb would have been mostly untestable, in the first place.)

Thank you for pointing this out!

> Create a VM with several
> assigned GPUs and you'll eventually cross that threshold and all 64-bit
> BARs will be moved above 4G.  I'm sure a few sufficiently sized ivshmem
> devices could do the same.  Thanks,

OK, I think I'll have to do those ivshmem tests, and rework the commit
message a bit.

Thanks!
Laszlo

> 
> Alex
> 
>> (Which is in
>> turn why ACPI test data for the "bios-tables-test" need not be refreshed.)
>>
>> Using an i440fx OVMF guest with 5GB RAM, an example _CRS change is:
>>
>>> @@ -881,9 +881,9 @@
>>>  QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, 
>>> Cacheable, ReadWrite,
>>>  0x, // Granularity
>>>  0x0008, // Range Minimum
>>> -0x00080001C0FF, // Range Maximum
>>> +0x00087FFF, // Range Maximum
>>>  0x, // Translation Offset
>>> -0x0001C100, // Length
>>> +0x8000, // Length
>>>  ,, , AddressRangeMemory, TypeStatic)
>>>  })
>>>  Device (GPE0)  
>>
>> (On i440fx, the low RAM split is at 3GB, in this case. Therefore, with 5GB
>> guest RAM and no DIMM hotplug range, pc_pci_hole64_start() returns 4 +
>> (5-3) = 6 GB. Adding the 2GB extension to that yields 8GB, which is below
>> the firmware-programmed base of 32GB, before the patch. Therefore, before
>> the patch, the extension is ineffective. After the patch, we add the 2GB
>> extension to the firmware-programmed base, namely 32GB.)
>>
>> Using a q35 OVMF guest with 5GB RAM, an example _CRS change is:
>>
>>> @@ -3162,9 +3162,9 @@
>>>  QWordMemory (ResourceProducer, PosDecode, MinFixed, Max

Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-26 Thread Laszlo Ersek
On 09/25/18 22:36, Alex Williamson wrote:
> On Tue, 25 Sep 2018 00:13:46 +0200
> Laszlo Ersek  wrote:
> 
>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in
>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size"
>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough"
>> 64-bit MMIO aperture to the guest OS for hotplug purposes.
>>
>> In that commit, we added or modified five functions:
>>
>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default
>>   64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips
>>   the DIMM hotplug area too (if any).
>>
>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start():
>>   board-specific 64-bit base property getters called abstractly by the
>>   ACPI generator. Both of these fall back to pc_pci_hole64_start() if the
>>   firmware didn't program any 64-bit hole (i.e. if the firmware didn't
>>   assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they
>>   honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit
>>   GPA programmed by the firmware as the base address for the aperture).
>>
>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end():
>>   these intended to extend the aperture to our size recommendation,
>>   calculated relative to the base of the aperture.
>>
>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and
>> q35_host_get_pci_hole64_end() currently only extend the aperture relative
>> to the default base (pc_pci_hole64_start()), ignoring any programming done
>> by the firmware. This means that our size recommendation may not be met.
>> Fix it by honoring the firmware's address assignments.
>>
>> The strange extension sizes were spotted by Alex, in the log of a guest
>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs).
>>
>> This change only affects DSDT generation, therefore no new compat property
>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to
>> 64-bit BARs, the patch makes no difference to SeaBIOS guests.
> 
> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but
> only if it cannot satisfy all the BARs from 32-bit MMMIO, see
> src/fw/pciinit.c:pci_bios_map_devices.  Create a VM with several
> assigned GPUs and you'll eventually cross that threshold and all 64-bit
> BARs will be moved above 4G.  I'm sure a few sufficiently sized ivshmem
> devices could do the same.  Thanks,

The effect of this patch is not hard to demonstrate with SeaBIOS+Q35,
when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device.

However, using SeaBIOS+i440fx, I can't show the difference. I've been
experimenting with various ivshmem devices (even multiple at the same
time, with different sizes). The "all or nothing" nature of SeaBIOS's
high allocation of the 64-bit BARs, combined with hugepage alignment
inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU
for i440fx, seem to make it surprisingly difficult to trigger the issue.

I figure I should:

(1) remove the sentence "the patch makes no difference to SeaBIOS
guests" from the commit message,

(2) include the DSDT diff on SeaBIOS/q35 in the commit message,

(3) remain silent on SeaBIOS/i440fx, in the commit message,

(4) append a new patch, for "bios-tables-test", so that the ACPI gen
change is validated as part of the test suite, on SeaBIOS/q35.

Regarding (4):

- is it OK if I add the test only for Q35?

- what guest RAM size am I allowed to use in the test suite? In my own
SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's
acceptable for the test suite.

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-26 Thread Laszlo Ersek
On 09/26/18 13:12, Laszlo Ersek wrote:

> (4) append a new patch, for "bios-tables-test", so that the ACPI gen
> change is validated as part of the test suite, on SeaBIOS/q35.
> 
> Regarding (4):
> 
> - is it OK if I add the test only for Q35?
> 
> - what guest RAM size am I allowed to use in the test suite? In my own
> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's
> acceptable for the test suite.

And, even if the patch's effect can be shown with little guest DRAM, the
test case still requires a multi-gig ivshmem-plain device. In
"tests/ivshmem-test.c", I see how it is set up -- the backend is set up
with shm_open(). The file created under /dev/shm (on Linux) might
require host RAM just the same as normal guest DRAM (especially with
memory overcommit disabled on the host), correct?

Thanks
Laszlo



Re: [Qemu-devel] Converting PCIDevice to VirtIODevice

2018-09-26 Thread Laszlo Ersek
On 09/26/18 17:33, Sameeh Jubran wrote:
> Hi All,
> 
> I have used the function "pci_qdev_find_device" to find a device using
> it's id. This is a virtio device and I'm trying to convert it to
> VirtIODevice.
> 
> What's the best way to do this? Simply converting it to DeviceState
> doesn't work and I think I should access the underlying virtio pci bus
> and through it access the virtio-device, but couldn't find any elegant
> way of doing so.

pci_qdev_find_device() produces a pointer-to-PCIDevice.

In virtio_pci_realize() [hw/virtio/virtio-pci.c] we see that the
following works:

VirtIOPCIProxy *proxy = VIRTIO_PCI(pci_dev);

And in virtio_pci_pre_plugged() [hw/virtio/virtio-pci.c], we find:

VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus);

Hope this helps (in fact, I hope it *works*! :) )
Laszlo



Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-26 Thread Laszlo Ersek
(+Eric)

On 09/26/18 14:10, Igor Mammedov wrote:
> On Wed, 26 Sep 2018 13:35:14 +0200
> Laszlo Ersek  wrote:
> 
>> On 09/26/18 13:12, Laszlo Ersek wrote:
>>
>>> (4) append a new patch, for "bios-tables-test", so that the ACPI gen
>>> change is validated as part of the test suite, on SeaBIOS/q35.
>>>
>>> Regarding (4):
>>>
>>> - is it OK if I add the test only for Q35?
>>>
>>> - what guest RAM size am I allowed to use in the test suite? In my own
>>> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's
>>> acceptable for the test suite.  
>>
>> And, even if the patch's effect can be shown with little guest DRAM, the
>> test case still requires a multi-gig ivshmem-plain device. In
>> "tests/ivshmem-test.c", I see how it is set up -- the backend is set up
>> with shm_open(). The file created under /dev/shm (on Linux) might
>> require host RAM just the same as normal guest DRAM (especially with
>> memory overcommit disabled on the host), correct?
> with over commit disable or cgroups limits enforced (I'd expect that
> in automated testing env i.e. travis or something else)
> allocating such amount of RAM probably would fail like crazy.
> 
> Maybe using memdev file backend with manually created sparse file
> might actually work (with preallocate disabled)

Thanks, this sounds like a good idea.

I see shm_open() is used heavily in ivshmem-related tests. I haven't
looked much at shm_open() before. (I've always known it existed in
POSIX, but I've never cared.)

So now I first checked what shm_open() would give me over a regular
temporary file created with open(); after all, the file descriptor
returned by either would have to be mmap()'d. From the rationale in POSIX:

<http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xsh_chap02.html#tag_22_02_08_14>,

it seems like the {shm_open(), mmap()} combo has two significant
guarantees over {open(), mmap()}:

- the namespace may be distinct (there need not be a writeable
  filesystem at all),

- the shared object will *always* be locked in RAM ("Shared memory is
  not just simply providing common access to data, it is providing the
  fastest possible communication between the processes").

The rationale seems to permit, on purpose, an shm_open() implementation
that is actually based on open(), using a special file system -- and
AIUI, /dev/shm is just that, on Linux.

Eric, does the above sound more or less correct?

If it is correct, then I think shm_open() is exactly what I *don't* want
for this use case. Because, while I do need a pathname for an
mmap()-able object (regular file, or otherwise), just so I can do:

  -object memory-backend-file,id=mem-obj,...,mem-path=... \
  -device ivshmem-plain,memdev=mem-obj,...

, I want the underlying object to put as little pressure on the system
that runs the test suite as possible.

This means I should specifically ask for a regular file, to be mmap()'d
(with MAP_SHARED). Then the kernel knows in advance that it can always
page out the dirty stuff, and the mapping shouldn't clash with cgroups,
or disabled memory overcommit.

Now, in order to make that actually safe, I should in theory ask for
preallocation on the filesystem (otherwise, if the filesystem runs out
of space, while the kernel is allocating fs extents in order to flush
the dirty pages to them, the process gets a SIGBUS, IIRC). However,
because I know that nothing will be in fact dirtied, I can minimize the
footprint on the filesystem as well, and forego preallocation too.

This suggests that, in my test case,
- I call g_file_open_tmp() for creating the temporary file,
- pass the returned fd to ftruncate() for resizing the temporary file,
- pass the returned pathname to the "memory-backend-file" objects, in
  the "mem-path" property,
- set "share=on",
- set "prealloc=off",
- "discard-data" is irrelevant (there won't be any dirty pages).

Thanks
Laszlo



Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-26 Thread Laszlo Ersek
(+Eric)

On 09/26/18 18:26, Alex Williamson wrote:
> On Wed, 26 Sep 2018 13:12:47 +0200
> Laszlo Ersek  wrote:
> 
>> On 09/25/18 22:36, Alex Williamson wrote:
>>> On Tue, 25 Sep 2018 00:13:46 +0200
>>> Laszlo Ersek  wrote:
>>>   
>>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
>>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in
>>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size"
>>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough"
>>>> 64-bit MMIO aperture to the guest OS for hotplug purposes.
>>>>
>>>> In that commit, we added or modified five functions:
>>>>
>>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default
>>>>   64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips
>>>>   the DIMM hotplug area too (if any).
>>>>
>>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start():
>>>>   board-specific 64-bit base property getters called abstractly by the
>>>>   ACPI generator. Both of these fall back to pc_pci_hole64_start() if the
>>>>   firmware didn't program any 64-bit hole (i.e. if the firmware didn't
>>>>   assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they
>>>>   honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit
>>>>   GPA programmed by the firmware as the base address for the aperture).
>>>>
>>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end():
>>>>   these intended to extend the aperture to our size recommendation,
>>>>   calculated relative to the base of the aperture.
>>>>
>>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and
>>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative
>>>> to the default base (pc_pci_hole64_start()), ignoring any programming done
>>>> by the firmware. This means that our size recommendation may not be met.
>>>> Fix it by honoring the firmware's address assignments.
>>>>
>>>> The strange extension sizes were spotted by Alex, in the log of a guest
>>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs).
>>>>
>>>> This change only affects DSDT generation, therefore no new compat property
>>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to
>>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests.  
>>>
>>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but
>>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see
>>> src/fw/pciinit.c:pci_bios_map_devices.  Create a VM with several
>>> assigned GPUs and you'll eventually cross that threshold and all 64-bit
>>> BARs will be moved above 4G.  I'm sure a few sufficiently sized ivshmem
>>> devices could do the same.  Thanks,  
>>
>> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35,
>> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device.
>>
>> However, using SeaBIOS+i440fx, I can't show the difference. I've been
>> experimenting with various ivshmem devices (even multiple at the same
>> time, with different sizes). The "all or nothing" nature of SeaBIOS's
>> high allocation of the 64-bit BARs, combined with hugepage alignment
>> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU
>> for i440fx, seem to make it surprisingly difficult to trigger the issue.
>>
>> I figure I should:
>>
>> (1) remove the sentence "the patch makes no difference to SeaBIOS
>> guests" from the commit message,
>>
>> (2) include the DSDT diff on SeaBIOS/q35 in the commit message,
>>
>> (3) remain silent on SeaBIOS/i440fx, in the commit message,
>>
>> (4) append a new patch, for "bios-tables-test", so that the ACPI gen
>> change is validated as part of the test suite, on SeaBIOS/q35.
>>
>> Regarding (4):
>>
>> - is it OK if I add the test only for Q35?
>>
>> - what guest RAM size am I allowed to use in the test suite? In my own
>> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's
>> acceptable for the test suite.
> 
> Seems like you've done due diligence, the plan looks ok to me.
> Regarding the test memory allocation, is it possible and reasonable to
> perhaps create a 256MB shared memory area and re-use it for multiple
> ivshmem devices?  ie. rather than 1, 4GB ivshmem device, use 16, 256MB
> devices, all with the same backing.  Thanks,

This too sounds useful. AIUI, ftruncate() is neither forbidden, nor
required, to allocate filesystem extents when increasing the size of a
file. Using one smaller regular temporary file as the common foundation
for multiple "memory-backend-file" objects will save space on the fs if
ftruncate() happens to allocate extents.

(I've also thought of passing the same "memory-backend-file" object to
multiple ivshmem-plain devices, but ivshmem_plain_realize()
[hw/misc/ivshmem.c] checks whether the HostMemoryBackend is already mapped.)

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH 2/2] hw/pci-host/x86: extend the 64-bit PCI hole relative to the fw-assigned base

2018-09-26 Thread Laszlo Ersek
On 09/26/18 18:26, Alex Williamson wrote:
> On Wed, 26 Sep 2018 13:12:47 +0200
> Laszlo Ersek  wrote:
> 
>> On 09/25/18 22:36, Alex Williamson wrote:
>>> On Tue, 25 Sep 2018 00:13:46 +0200
>>> Laszlo Ersek  wrote:
>>>   
>>>> In commit 9fa99d2519cb ("hw/pci-host: Fix x86 Host Bridges 64bit PCI
>>>> hole", 2017-11-16), we meant to expose such a 64-bit PCI MMIO aperture in
>>>> the ACPI DSDT that would be at least as large as the new "pci-hole64-size"
>>>> property (2GB on i440fx, 32GB on q35). The goal was to offer "enough"
>>>> 64-bit MMIO aperture to the guest OS for hotplug purposes.
>>>>
>>>> In that commit, we added or modified five functions:
>>>>
>>>> - pc_pci_hole64_start(): shared between i440fx and q35. Provides a default
>>>>   64-bit base, which starts beyond the cold-plugged 64-bit RAM, and skips
>>>>   the DIMM hotplug area too (if any).
>>>>
>>>> - i440fx_pcihost_get_pci_hole64_start(), q35_host_get_pci_hole64_start():
>>>>   board-specific 64-bit base property getters called abstractly by the
>>>>   ACPI generator. Both of these fall back to pc_pci_hole64_start() if the
>>>>   firmware didn't program any 64-bit hole (i.e. if the firmware didn't
>>>>   assign a 64-bit GPA to any MMIO BAR on any device). Otherwise, they
>>>>   honor the firmware's BAR assignments (i.e., they treat the lowest 64-bit
>>>>   GPA programmed by the firmware as the base address for the aperture).
>>>>
>>>> - i440fx_pcihost_get_pci_hole64_end(), q35_host_get_pci_hole64_end():
>>>>   these intended to extend the aperture to our size recommendation,
>>>>   calculated relative to the base of the aperture.
>>>>
>>>> Despite the original intent, i440fx_pcihost_get_pci_hole64_end() and
>>>> q35_host_get_pci_hole64_end() currently only extend the aperture relative
>>>> to the default base (pc_pci_hole64_start()), ignoring any programming done
>>>> by the firmware. This means that our size recommendation may not be met.
>>>> Fix it by honoring the firmware's address assignments.
>>>>
>>>> The strange extension sizes were spotted by Alex, in the log of a guest
>>>> kernel running on top of OVMF (which does assign 64-bit GPAs to BARs).
>>>>
>>>> This change only affects DSDT generation, therefore no new compat property
>>>> is being introduced. Also, because SeaBIOS never assigns 64-bit GPAs to
>>>> 64-bit BARs, the patch makes no difference to SeaBIOS guests.  
>>>
>>> This is not exactly true, SeaBIOS will make use of 64-bit MMIO, but
>>> only if it cannot satisfy all the BARs from 32-bit MMMIO, see
>>> src/fw/pciinit.c:pci_bios_map_devices.  Create a VM with several
>>> assigned GPUs and you'll eventually cross that threshold and all 64-bit
>>> BARs will be moved above 4G.  I'm sure a few sufficiently sized ivshmem
>>> devices could do the same.  Thanks,  
>>
>> The effect of this patch is not hard to demonstrate with SeaBIOS+Q35,
>> when using e.g. 5GB of guest RAM and a 4GB ivshmem-plain device.
>>
>> However, using SeaBIOS+i440fx, I can't show the difference. I've been
>> experimenting with various ivshmem devices (even multiple at the same
>> time, with different sizes). The "all or nothing" nature of SeaBIOS's
>> high allocation of the 64-bit BARs, combined with hugepage alignment
>> inside SeaBIOS, combined with the small (2GB) rounding size used in QEMU
>> for i440fx, seem to make it surprisingly difficult to trigger the issue.
>>
>> I figure I should:
>>
>> (1) remove the sentence "the patch makes no difference to SeaBIOS
>> guests" from the commit message,
>>
>> (2) include the DSDT diff on SeaBIOS/q35 in the commit message,
>>
>> (3) remain silent on SeaBIOS/i440fx, in the commit message,
>>
>> (4) append a new patch, for "bios-tables-test", so that the ACPI gen
>> change is validated as part of the test suite, on SeaBIOS/q35.
>>
>> Regarding (4):
>>
>> - is it OK if I add the test only for Q35?
>>
>> - what guest RAM size am I allowed to use in the test suite? In my own
>> SeaBIOS/Q35 reproducer I currently use 5GB, but I'm not sure if it's
>> acceptable for the test suite.
> 
> Seems like you've done due diligence, the plan looks ok to me.
> Regarding the test memory allocati

Re: [Qemu-devel] TPM status

2017-07-01 Thread Laszlo Ersek
On 06/29/17 21:31, Stefan Berger wrote:
> On 06/27/2017 12:32 PM, Laszlo Ersek wrote:
>>
>> Looks great to me, thank you!
>>
>> Two requests in addition to the above remarks:
>> - can you provide command line options / examples wherever appropriate?
> 
> I didn't add it because we describe that on this page here:
> 
> http://download.qemu.org/qemu-doc.html
> 
> 
> "To create a passthrough TPM use the following two options:
> 
> -tpmdev passthrough,id=tpm0 -device tpm-tis,tpmdev=tpm0"

Yes, I saw that in the manual. The manual is huge, and personally I'd
prefer either an embedded example or a more targeted reference.

At least in "docs/pcie.txt", Marcel added a whole bunch of command line
snippets, and it is *very* useful (to me anyway).
"docs/specs/fw_cfg.txt" also talks about the command line under
"Externally Provided Items".

Thanks for considering it,
Laszlo



Re: [Qemu-devel] [PATCH 1/7] vmgenid: replace x-write-pointer-available hack

2017-07-03 Thread Laszlo Ersek
On 06/29/17 15:23, Marc-André Lureau wrote:
> This compat property sole function is to prevent the device from being
> instantiated. Instead of requiring an extra compat property, check if
> fw_cfg has DMA enabled.
> 
> This has the additional benefit of handling other cases properly, like:
> 
>   $ qemu-system-x86_64 -device vmgenid -machine none
>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support in 
> fw_cfg, which this machine type does not provide
>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.9 -global 
> fw_cfg.dma_enabled=off
>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support in 
> fw_cfg, which this machine type does not provide
>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.6 -global 
> fw_cfg.dma_enabled=on
>   [boots normally]
> 
> Suggested-by: Eduardo Habkost 
> Signed-off-by: Marc-André Lureau 
> ---
>  include/hw/acpi/bios-linker-loader.h | 2 ++
>  include/hw/compat.h  | 4 
>  hw/acpi/bios-linker-loader.c | 6 ++
>  hw/acpi/vmgenid.c| 9 +
>  4 files changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/include/hw/acpi/bios-linker-loader.h 
> b/include/hw/acpi/bios-linker-loader.h
> index efe17b0b9c..a711dbced8 100644
> --- a/include/hw/acpi/bios-linker-loader.h
> +++ b/include/hw/acpi/bios-linker-loader.h
> @@ -7,6 +7,8 @@ typedef struct BIOSLinker {
>  GArray *file_list;
>  } BIOSLinker;
>  
> +bool bios_linker_loader_can_write_pointer(void);
> +
>  BIOSLinker *bios_linker_loader_init(void);
>  
>  void bios_linker_loader_alloc(BIOSLinker *linker,
> diff --git a/include/hw/compat.h b/include/hw/compat.h
> index 26cd5851a5..36f02179ac 100644
> --- a/include/hw/compat.h
> +++ b/include/hw/compat.h
> @@ -150,10 +150,6 @@
>  .driver   = "fw_cfg_io",\
>  .property = "dma_enabled",\
>  .value= "off",\
> -},{\
> -.driver   = "vmgenid",\
> -.property = "x-write-pointer-available",\
> -.value= "off",\
>  },
>  
>  #define HW_COMPAT_2_3 \
> diff --git a/hw/acpi/bios-linker-loader.c b/hw/acpi/bios-linker-loader.c
> index 046183a0f1..587d62cb93 100644
> --- a/hw/acpi/bios-linker-loader.c
> +++ b/hw/acpi/bios-linker-loader.c
> @@ -168,6 +168,12 @@ bios_linker_find_file(const BIOSLinker *linker, const 
> char *name)
>  return NULL;
>  }
>  
> +bool bios_linker_loader_can_write_pointer(void)
> +{
> +FWCfgState *fw_cfg = fw_cfg_find();
> +return fw_cfg && fw_cfg_dma_enabled(fw_cfg);
> +}
> +
>  /*
>   * bios_linker_loader_alloc: ask guest to load file into guest memory.
>   *
> diff --git a/hw/acpi/vmgenid.c b/hw/acpi/vmgenid.c
> index a32b847fe0..ab5da293fd 100644
> --- a/hw/acpi/vmgenid.c
> +++ b/hw/acpi/vmgenid.c
> @@ -205,17 +205,11 @@ static void vmgenid_handle_reset(void *opaque)
>  memset(vms->vmgenid_addr_le, 0, ARRAY_SIZE(vms->vmgenid_addr_le));
>  }
>  
> -static Property vmgenid_properties[] = {
> -DEFINE_PROP_BOOL("x-write-pointer-available", VmGenIdState,
> - write_pointer_available, true),
> -DEFINE_PROP_END_OF_LIST(),
> -};
> -
>  static void vmgenid_realize(DeviceState *dev, Error **errp)
>  {
>  VmGenIdState *vms = VMGENID(dev);
>  
> -if (!vms->write_pointer_available) {
> +if (!bios_linker_loader_can_write_pointer()) {
>  error_setg(errp, "%s requires DMA write support in fw_cfg, "
> "which this machine type does not provide", 
> VMGENID_DEVICE);
>  return;
> @@ -239,7 +233,6 @@ static void vmgenid_device_class_init(ObjectClass *klass, 
> void *data)
>  dc->vmsd = &vmstate_vmgenid;
>  dc->realize = vmgenid_realize;
>  dc->hotpluggable = false;
> -dc->props = vmgenid_properties;
>  
>  object_class_property_add_str(klass, VMGENID_GUID, NULL,
>vmgenid_set_guid, NULL);
> 

I believe we discussed this approach back then (but I can't find the
relevant messages, of course).

What guarantees that, by the time you call fw_cfg_find() from
vmgenid_realize() -- that is, from the realize function of an
independent device --, the fw_cfg device will have been realized (with
its properties having taken their final values)? I don't see how the
ordering is guaranteed here; please explain (preferably in the commit
message).

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH 1/7] vmgenid: replace x-write-pointer-available hack

2017-07-03 Thread Laszlo Ersek
On 07/03/17 20:27, Eduardo Habkost wrote:
> On Mon, Jul 03, 2017 at 08:06:33PM +0200, Laszlo Ersek wrote:
>> On 06/29/17 15:23, Marc-André Lureau wrote:
>>> This compat property sole function is to prevent the device from being
>>> instantiated. Instead of requiring an extra compat property, check if
>>> fw_cfg has DMA enabled.
>>>
>>> This has the additional benefit of handling other cases properly, like:
>>>
>>>   $ qemu-system-x86_64 -device vmgenid -machine none
>>>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support 
>>> in fw_cfg, which this machine type does not provide
>>>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.9 -global 
>>> fw_cfg.dma_enabled=off
>>>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support 
>>> in fw_cfg, which this machine type does not provide
>>>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.6 -global 
>>> fw_cfg.dma_enabled=on
>>>   [boots normally]
>>>
>>> Suggested-by: Eduardo Habkost 
>>> Signed-off-by: Marc-André Lureau 
>>> ---
>>>  include/hw/acpi/bios-linker-loader.h | 2 ++
>>>  include/hw/compat.h  | 4 
>>>  hw/acpi/bios-linker-loader.c | 6 ++
>>>  hw/acpi/vmgenid.c| 9 +
>>>  4 files changed, 9 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/hw/acpi/bios-linker-loader.h 
>>> b/include/hw/acpi/bios-linker-loader.h
>>> index efe17b0b9c..a711dbced8 100644
>>> --- a/include/hw/acpi/bios-linker-loader.h
>>> +++ b/include/hw/acpi/bios-linker-loader.h
>>> @@ -7,6 +7,8 @@ typedef struct BIOSLinker {
>>>  GArray *file_list;
>>>  } BIOSLinker;
>>>  
>>> +bool bios_linker_loader_can_write_pointer(void);
>>> +
>>>  BIOSLinker *bios_linker_loader_init(void);
>>>  
>>>  void bios_linker_loader_alloc(BIOSLinker *linker,
>>> diff --git a/include/hw/compat.h b/include/hw/compat.h
>>> index 26cd5851a5..36f02179ac 100644
>>> --- a/include/hw/compat.h
>>> +++ b/include/hw/compat.h
>>> @@ -150,10 +150,6 @@
>>>  .driver   = "fw_cfg_io",\
>>>  .property = "dma_enabled",\
>>>  .value= "off",\
>>> -},{\
>>> -.driver   = "vmgenid",\
>>> -.property = "x-write-pointer-available",\
>>> -.value= "off",\
>>>  },
>>>  
>>>  #define HW_COMPAT_2_3 \
>>> diff --git a/hw/acpi/bios-linker-loader.c b/hw/acpi/bios-linker-loader.c
>>> index 046183a0f1..587d62cb93 100644
>>> --- a/hw/acpi/bios-linker-loader.c
>>> +++ b/hw/acpi/bios-linker-loader.c
>>> @@ -168,6 +168,12 @@ bios_linker_find_file(const BIOSLinker *linker, const 
>>> char *name)
>>>  return NULL;
>>>  }
>>>  
>>> +bool bios_linker_loader_can_write_pointer(void)
>>> +{
>>> +FWCfgState *fw_cfg = fw_cfg_find();
>>> +return fw_cfg && fw_cfg_dma_enabled(fw_cfg);
>>> +}
>>> +
>>>  /*
>>>   * bios_linker_loader_alloc: ask guest to load file into guest memory.
>>>   *
>>> diff --git a/hw/acpi/vmgenid.c b/hw/acpi/vmgenid.c
>>> index a32b847fe0..ab5da293fd 100644
>>> --- a/hw/acpi/vmgenid.c
>>> +++ b/hw/acpi/vmgenid.c
>>> @@ -205,17 +205,11 @@ static void vmgenid_handle_reset(void *opaque)
>>>  memset(vms->vmgenid_addr_le, 0, ARRAY_SIZE(vms->vmgenid_addr_le));
>>>  }
>>>  
>>> -static Property vmgenid_properties[] = {
>>> -DEFINE_PROP_BOOL("x-write-pointer-available", VmGenIdState,
>>> - write_pointer_available, true),
>>> -DEFINE_PROP_END_OF_LIST(),
>>> -};
>>> -
>>>  static void vmgenid_realize(DeviceState *dev, Error **errp)
>>>  {
>>>  VmGenIdState *vms = VMGENID(dev);
>>>  
>>> -if (!vms->write_pointer_available) {
>>> +if (!bios_linker_loader_can_write_pointer()) {
>>>  error_setg(errp, "%s requires DMA write support in fw_cfg, "
>>> "which this machine type does not provide", 
>>> VMGENID_DEVICE);
>>>  return;
>>> @@ -239,7 +233,6 @@ static void vmgenid_device_class_init(ObjectClass 
>>> *klass, void *data)
>>>  dc->vmsd = &vmstate_vmge

Re: [Qemu-devel] [PATCH] specs: Describe the TPM support in QEMU

2017-07-04 Thread Laszlo Ersek
On 06/29/17 20:00, Stefan Berger wrote:
> This patch adds a description of the current TPM support in QEMU
> to the specs.
> 
> Several public specs are referenced via their landing page on the
> trustedcomputinggroup.org website.
> 
> Signed-off-by: Stefan Berger 
> ---
>  docs/specs/tpm.txt | 98 
> ++
>  1 file changed, 98 insertions(+)
>  create mode 100644 docs/specs/tpm.txt
> 
> diff --git a/docs/specs/tpm.txt b/docs/specs/tpm.txt
> new file mode 100644
> index 000..6472989
> --- /dev/null
> +++ b/docs/specs/tpm.txt
> @@ -0,0 +1,98 @@
> +
> +QEMU TPM Device
> +===
> +
> += Guest-side Hardware Interface =
> +
> +The QEMU TPM emulation implements a TPM TIS hardware interface following
> +the Trusted Computing Group's specification "TCG PC Client Specific TPM
> +Interface Specification (TIS)", Specifcation Version 1.3, 21 March 2013.
> +This specification, or a later version of it, can be accessed from the
> +following URL:
> +
> +https://trustedcomputinggroup.org/pc-client-work-group-pc-client-specific-tpm-interface-specification-tis/
> +
> +The TIS interface makes a memory mapped IO region in the area 0xfed4 -
> +0xfed44fff available to the guest operating system.
> +
> +
> +QEMU files related to TPM TIS interfaceL

... OK, Eric mentioned the typo already

> + - hw/tpm/tpm_tis.c
> + - hw/tpm/tpm_tis.h
> +
> +
> += ACPI Interface =
> +
> +The TPM device is defined with ACPI ID "PNP0C31". QEMU builds a SSDT and 
> passes
> +it into the guest through the fw_cfg device. The device description contains
> +the base address of the TIS interface 0xfed4 and the size of the MMIO 
> area
> +(0x5000). In case a TPM2 is used by QEMU, a TPM2 ACPI table is also provided.
> +The device is described to be used in polling mode rather than interrupt mode
> +primarily because no unused IRQ could be found.
> +
> +To support measurement logs to be written by the firmware, e.g. SeaBIOS, a 
> TCPA
> +table is implemented. This table provides a 64kb buffer where the firmware 
> can
> +write its log into. For TPM 2 only a more recent version of the TPM2 table
> +provides support for measurements logs and a TCPA table does not need to be
> +created.
> +
> +The TCPA and TPM2 ACPI tables follow the Trusted Computing Group 
> specification
> +"TCG ACPI Specification" Family "1.2" and "2.0", Level 00 Revision 00.37. 
> This
> +specification, or a later version of it, can be accessed from the following
> +URL:
> +
> +https://trustedcomputinggroup.org/tcg-acpi-specification/
> +
> +
> +QEMU files related to TPM ACPI tables:
> + - hw/i386/acpi-build.c
> + - include/hw/acpi/tpm.h
> +
> +
> += TPM backend devices =
> +
> +The TPM implementation is split into two parts, frontend and backend. 

git-am complains that this line adds a whitespace error (trailing space,
namely).

> +The frontend part is the hardware interface, such as the TPM TIS interface
> +described earlier, and the other part is the TPM backend interface. The
> +backend interfaces implement the interaction with a TPM device,
> +which may be a physical or an emulated device. The split between the front-
> +and backend devices allows a frontend to be connected with any available
> +backend. This enables the TIS interface to be used with the passthrough
> +backend or the swtpm backend.

In the previous (RFC-like) version, you wrote "(future) swtpm" -- I
think that would be more precise. When QEMU gets swtpm support, we can
update this language (because we should update the filename list(s) as
well). What do you think?

> +
> +
> +QEMU file related to TPM backends:

s/file/files/

> + - backends/tpm.c
> + - include/sysemu/tpm_backend.h
> + - include/sysemu/tpm_backend_int.h

Is this generic backend code? Because for the passthrough backend, you
provide a separate file list below (which is great, but then we should
qualify *this* list as "generic" or some such).

> +
> +
> +== The QEMU TPM passthrough device ==
> +
> +In case QEMU is run on Linux as the host operating system it is possible to
> +make the hardware TPM device available to a single QEMU guest. In this case 
> the
> +user must make sure that no other program is using the device, e.g., 
> /dev/tpm0,
> +before trying to start QEMU with it.
> +
> +The passthrough driver uses the host's TPM device for sending TPM commands
> +and receiving responses from. Besides that it accesses the TPM device's sysfs
> +entry for support of command cancellation. Since none of the state of a 
> hardware
> +TPM can be migrated between hosts, virtual machine migration is disabled when
> +the TPM passthrough driver is used.
> +
> +Since the host's TPM device will already be initialize by the host's 
> firmware,

s/initialize/initialized/

... ah, also pointed out by Eric.

> +certain commands, e.g. TPM_Startup(), sent by the virtual firmware for device
> +initialization, will fail. In this case the firmware should simply not use 
> the
> +TPM.
> +
> +Sharing the d

Re: [Qemu-devel] [PATCH 2/7] acpi: add vmcoreinfo device

2017-07-04 Thread Laszlo Ersek
comments below

On 06/29/17 15:23, Marc-André Lureau wrote:
> The VM coreinfo (vmcoreinfo) device is an emulated device which
> exposes a 4k memory range to the guest to store various informations
> useful to debug the guest OS. (it is greatly inspired by the VMGENID
> device implementation)
> 
> This is an early-boot alternative to the qemu-ga VMDUMP_INFO event
> proposed in "[PATCH 00/21] WIP: dump: add kaslr support".
> 
> A proof-of-concept kernel module:
> https://github.com/elmarco/vmgenid-test/blob/master/qemuvmci-test.c
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  include/hw/acpi/aml-build.h|   1 +
>  include/hw/acpi/vmcoreinfo.h   |  36 +++
>  hw/acpi/aml-build.c|   2 +
>  hw/acpi/vmcoreinfo.c   | 198 
> +
>  hw/i386/acpi-build.c   |  14 +++
>  default-configs/arm-softmmu.mak|   1 +
>  default-configs/i386-softmmu.mak   |   1 +
>  default-configs/x86_64-softmmu.mak |   1 +
>  docs/specs/vmcoreinfo.txt  | 138 ++
>  hw/acpi/Makefile.objs  |   1 +
>  10 files changed, 393 insertions(+)
>  create mode 100644 include/hw/acpi/vmcoreinfo.h
>  create mode 100644 hw/acpi/vmcoreinfo.c
>  create mode 100644 docs/specs/vmcoreinfo.txt
>
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 88d0738d76..cf781bcd34 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -211,6 +211,7 @@ struct AcpiBuildTables {
>  GArray *rsdp;
>  GArray *tcpalog;
>  GArray *vmgenid;
> +GArray *vmcoreinfo;
>  BIOSLinker *linker;
>  } AcpiBuildTables;
>  
> diff --git a/include/hw/acpi/vmcoreinfo.h b/include/hw/acpi/vmcoreinfo.h
> new file mode 100644
> index 00..40fe99c3ed
> --- /dev/null
> +++ b/include/hw/acpi/vmcoreinfo.h
> @@ -0,0 +1,36 @@
> +#ifndef ACPI_VMCOREINFO_H
> +#define ACPI_VMCOREINFO_H
> +
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "hw/qdev.h"
> +
> +#define VMCOREINFO_DEVICE   "vmcoreinfo"
> +#define VMCOREINFO_FW_CFG_FILE  "etc/vmcoreinfo"
> +#define VMCOREINFO_ADDR_FW_CFG_FILE "etc/vmcoreinfo-addr"
> +
> +#define VMCOREINFO_FW_CFG_SIZE  4096 /* Occupy a page of memory */
> +#define VMCOREINFO_OFFSET   40   /* allow space for
> +  * OVMF SDT Header Probe Supressor
> +  */
> +
> +#define VMCOREINFO(obj) OBJECT_CHECK(VmcoreinfoState, (obj), 
> VMCOREINFO_DEVICE)
> +
> +typedef struct VmcoreinfoState {

I think this should be spelled with a bit more camel-casing, like
VMCoreInfoState or some such.

> +DeviceClass parent_obj;
> +uint8_t vmcoreinfo_addr_le[8];   /* Address of memory region */
> +bool write_pointer_available;
> +} VmcoreinfoState;
> +
> +/* returns NULL unless there is exactly one device */
> +static inline Object *find_vmcoreinfo_dev(void)
> +{
> +return object_resolve_path_type("", VMCOREINFO_DEVICE, NULL);
> +}
> +
> +void vmcoreinfo_build_acpi(VmcoreinfoState *vis, GArray *table_data,
> +   GArray *vmci, BIOSLinker *linker);
> +void vmcoreinfo_add_fw_cfg(VmcoreinfoState *vis, FWCfgState *s, GArray 
> *vmci);
> +bool vmcoreinfo_get(VmcoreinfoState *vis, uint64_t *paddr, uint64_t *size,
> +Error **errp);
> +
> +#endif
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 36a6cc450e..47043ade4a 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1561,6 +1561,7 @@ void acpi_build_tables_init(AcpiBuildTables *tables)
>  tables->table_data = g_array_new(false, true /* clear */, 1);
>  tables->tcpalog = g_array_new(false, true /* clear */, 1);
>  tables->vmgenid = g_array_new(false, true /* clear */, 1);
> +tables->vmcoreinfo = g_array_new(false, true /* clear */, 1);
>  tables->linker = bios_linker_loader_init();
>  }
>  
> @@ -1571,6 +1572,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, 
> bool mfre)
>  g_array_free(tables->table_data, true);
>  g_array_free(tables->tcpalog, mfre);
>  g_array_free(tables->vmgenid, mfre);
> +g_array_free(tables->vmcoreinfo, mfre);
>  }
>  
>  /* Build rsdt table */
> diff --git a/hw/acpi/vmcoreinfo.c b/hw/acpi/vmcoreinfo.c
> new file mode 100644
> index 00..216e0bb83a
> --- /dev/null
> +++ b/hw/acpi/vmcoreinfo.c
> @@ -0,0 +1,198 @@
> +/*
> + *  Virtual Machine coreinfo device
> + *  (based on Virtual Machine Generation ID Device)
> + *
> + *  Copyright (C) 2017 Red Hat, Inc.
> + *  Copyright (C) 2017 Skyport Systems.
> + *
> + *  Authors: Marc-André Lureau 
> + *   Ben Warren 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +#include "qemu/osdep.h"
> +#include "hw/acpi/acpi.h"
> +#include "hw/acpi/aml-build.h"
> +#include "hw/acpi/vmcoreinfo.h"
> +#include "hw/nvram/fw_cfg.h"
> +

Re: [Qemu-devel] [PATCH 4/7] dump: add vmcoreinfo ELF note

2017-07-04 Thread Laszlo Ersek
On 06/29/17 15:23, Marc-André Lureau wrote:
> Read the vmcoreinfo ELF PT_NOTE from guest memory when vmcoreinfo
> device provides the location, and write it as an ELF note in the dump.
> 
> There are now 2 possible sources of phys_base information.
> 
> (1) arch guessed value from arch_dump_info_get()

The function is called cpu_get_dump_info().

> (2) vmcoreinfo ELF note NUMBER(phys_base)= field
> 
> NUMBER(phys_base) in vmcoreinfo has only been recently introduced
> in Linux 4.10 (401721ecd1dc "kexec: export the value of phys_base
> instead of symbol address").
> 
> Since (2) has better chances to be accurate, the guessed value is
> replaced by the value from the vmcoreinfo ELF note.
> 
> The phys_base value is stored in the same dump field locations as
> before, and may duplicate the information available in the vmcoreinfo
> ELF PT_NOTE. Crash tools should be prepared to handle this case.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  include/sysemu/dump.h |   2 +
>  dump.c| 117 
> ++
>  2 files changed, 119 insertions(+)
> 
> diff --git a/include/sysemu/dump.h b/include/sysemu/dump.h
> index 2672a15f8b..111a7dcaa4 100644
> --- a/include/sysemu/dump.h
> +++ b/include/sysemu/dump.h
> @@ -192,6 +192,8 @@ typedef struct DumpState {
>* this could be used to calculate
>* how much work we have
>* finished. */
> +uint8_t *vmcoreinfo; /* ELF note content */
> +size_t vmcoreinfo_size;
>  } DumpState;
>  
>  uint16_t cpu_to_dump16(DumpState *s, uint16_t val);
> diff --git a/dump.c b/dump.c
> index d9090a24cc..8fda5cc1ed 100644
> --- a/dump.c
> +++ b/dump.c
> @@ -26,6 +26,8 @@
>  #include "qapi/qmp/qerror.h"
>  #include "qmp-commands.h"
>  #include "qapi-event.h"
> +#include "qemu/error-report.h"
> +#include "hw/acpi/vmcoreinfo.h"
>  
>  #include 
>  #ifdef CONFIG_LZO
> @@ -38,6 +40,11 @@
>  #define ELF_MACHINE_UNAME "Unknown"
>  #endif
>  
> +#define ELF_NOTE_SIZE(hdr_size, name_size, desc_size)   \
> +((DIV_ROUND_UP((hdr_size), 4)   \
> +  + DIV_ROUND_UP((name_size), 4)\
> +  + DIV_ROUND_UP((desc_size), 4)) * 4)
> +

This looks really useful to me, but (I think?) we generally leave the
operator hanging at the end of the line:

#define ELF_NOTE_SIZE(hdr_size, name_size, desc_size) \
((DIV_ROUND_UP((hdr_size), 4) +   \
  DIV_ROUND_UP((name_size), 4) +  \
  DIV_ROUND_UP((desc_size), 4)) * 4)

>  uint16_t cpu_to_dump16(DumpState *s, uint16_t val)
>  {
>  if (s->dump_info.d_endian == ELFDATA2LSB) {
> @@ -76,6 +83,8 @@ static int dump_cleanup(DumpState *s)
>  guest_phys_blocks_free(&s->guest_phys_blocks);
>  memory_mapping_list_free(&s->list);
>  close(s->fd);
> +g_free(s->vmcoreinfo);
> +s->vmcoreinfo = NULL;
>  if (s->resume) {
>  if (s->detached) {
>  qemu_mutex_lock_iothread();
> @@ -235,6 +244,19 @@ static inline int cpu_index(CPUState *cpu)
>  return cpu->cpu_index + 1;
>  }
>  
> +static void write_vmcoreinfo_note(WriteCoreDumpFunction f, DumpState *s,
> +  Error **errp)
> +{
> +int ret;
> +
> +if (s->vmcoreinfo) {
> +ret = f(s->vmcoreinfo, s->vmcoreinfo_size, s);
> +if (ret < 0) {
> +error_setg(errp, "dump: failed to write vmcoreinfo");
> +}
> +}
> +}
> +
>  static void write_elf64_notes(WriteCoreDumpFunction f, DumpState *s,
>Error **errp)
>  {
> @@ -258,6 +280,8 @@ static void write_elf64_notes(WriteCoreDumpFunction f, 
> DumpState *s,
>  return;
>  }
>  }
> +
> +write_vmcoreinfo_note(f, s, errp);
>  }
>  
>  static void write_elf32_note(DumpState *s, Error **errp)
> @@ -303,6 +327,8 @@ static void write_elf32_notes(WriteCoreDumpFunction f, 
> DumpState *s,
>  return;
>  }
>  }
> +
> +write_vmcoreinfo_note(f, s, errp);
>  }
>  

Wait, I'm confused again. You explained why it was OK to hook this logic
into the kdump handling too, but I don't think I understand your
explanation, so let me repeat my confusion below :)

In the ELF case, this code works fine, I think. As long as the guest
provided us with a well-formed note, a well-formed note will be appended
to the ELF dump.

But, this code is also invoked in the kdump case, and I don't understand
why that's a good thing. If I understand the next patch correctly, the
kdump format already provides crash with a (trimmed) copy of the guest
kernels vmcoreinfo note. So in the kdump case, why do we have to create
yet another copy of the guest kernel's vmcoreinfo note?

Thus, my confusion persists, and I can only think (again) that
write_vmcoreinfo_note() should be called from dump_begin() only (at the
end). (And the s->note_size adjustment should take that int

Re: [Qemu-devel] [PATCH 5/7] kdump: add vmcoreinfo ELF note

2017-07-04 Thread Laszlo Ersek
On 06/29/17 15:23, Marc-André Lureau wrote:
> kdump header provides offset and size of the vmcoreinfo ELF note,
> append it if available.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  dump.c | 48 
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/dump.c b/dump.c
> index 8fda5cc1ed..b78bc1fda7 100644
> --- a/dump.c
> +++ b/dump.c
> @@ -788,8 +788,9 @@ static void create_header32(DumpState *s, Error **errp)
>  uint32_t sub_hdr_size;
>  uint32_t bitmap_blocks;
>  uint32_t status = 0;
> -uint64_t offset_note;
> +uint64_t offset_note, offset_vmcoreinfo, size_vmcoreinfo = 0;
>  Error *local_err = NULL;
> +uint8_t *vmcoreinfo = NULL;
>  
>  /* write common header, the version of kdump-compressed format is 6th */
>  size = sizeof(DiskDumpHeader32);
> @@ -838,7 +839,18 @@ static void create_header32(DumpState *s, Error **errp)
>  kh->phys_base = cpu_to_dump32(s, s->dump_info.phys_base);
>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>  
> -offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +offset_vmcoreinfo = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +if (s->vmcoreinfo) {
> +uint64_t hsize, name_size;
> +
> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size, 
> &size_vmcoreinfo);

Should we round up "size_vmcoreinfo" as well? (Without the rounding,
offset_note might become unaligned, plus I simply don't know what
alignment is expected from kh->size_vmcoreinfo.)

> +vmcoreinfo =
> +s->vmcoreinfo + (DIV_ROUND_UP(hsize, 4) + 
> DIV_ROUND_UP(name_size, 4)) * 4;
> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
> +kh->size_vmcoreinfo = cpu_to_dump32(s, size_vmcoreinfo);
> +}
> +
> +offset_note = offset_vmcoreinfo + size_vmcoreinfo;
>  kh->offset_note = cpu_to_dump64(s, offset_note);
>  kh->note_size = cpu_to_dump32(s, s->note_size);
>  
> @@ -848,6 +860,14 @@ static void create_header32(DumpState *s, Error **errp)
>  goto out;
>  }
>  
> +if (vmcoreinfo) {
> +if (write_buffer(s->fd, offset_vmcoreinfo, vmcoreinfo,
> + size_vmcoreinfo) < 0) {
> +error_setg(errp, "dump: failed to vmcoreinfo");

The verb "write" is missing from the message.

Same comments for create_header64() below.

Thanks
Laszlo

> +goto out;
> +}
> +}
> +
>  /* write note */
>  s->note_buf = g_malloc0(s->note_size);
>  s->note_buf_offset = 0;
> @@ -888,8 +908,9 @@ static void create_header64(DumpState *s, Error **errp)
>  uint32_t sub_hdr_size;
>  uint32_t bitmap_blocks;
>  uint32_t status = 0;
> -uint64_t offset_note;
> +uint64_t offset_note, offset_vmcoreinfo, size_vmcoreinfo = 0;
>  Error *local_err = NULL;
> +uint8_t *vmcoreinfo = NULL;
>  
>  /* write common header, the version of kdump-compressed format is 6th */
>  size = sizeof(DiskDumpHeader64);
> @@ -938,7 +959,18 @@ static void create_header64(DumpState *s, Error **errp)
>  kh->phys_base = cpu_to_dump64(s, s->dump_info.phys_base);
>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>  
> -offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +offset_vmcoreinfo = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +if (s->vmcoreinfo) {
> +uint64_t hsize, name_size;
> +
> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size, 
> &size_vmcoreinfo);
> +vmcoreinfo =
> +s->vmcoreinfo + (DIV_ROUND_UP(hsize, 4) + 
> DIV_ROUND_UP(name_size, 4)) * 4;
> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
> +kh->size_vmcoreinfo = cpu_to_dump64(s, size_vmcoreinfo);
> +}
> +
> +offset_note = offset_vmcoreinfo + size_vmcoreinfo;
>  kh->offset_note = cpu_to_dump64(s, offset_note);
>  kh->note_size = cpu_to_dump64(s, s->note_size);
>  
> @@ -948,6 +980,14 @@ static void create_header64(DumpState *s, Error **errp)
>  goto out;
>  }
>  
> +if (vmcoreinfo) {
> +if (write_buffer(s->fd, offset_vmcoreinfo, vmcoreinfo,
> + size_vmcoreinfo) < 0) {
> +error_setg(errp, "dump: failed to vmcoreinfo");
> +goto out;
> +}
> +}
> +
>  /* write note */
>  s->note_buf = g_malloc0(s->note_size);
>  s->note_buf_offset = 0;
> 




Re: [Qemu-devel] [PATCH 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-04 Thread Laszlo Ersek
On 06/29/17 15:23, Marc-André Lureau wrote:
> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
> 
> To help the python script, add a little global vmcoreinfo_gdb
> structure, that is populated with vmcoreinfo_gdb_update().
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  scripts/dump-guest-memory.py | 32 
>  include/hw/acpi/vmcoreinfo.h |  1 +
>  hw/acpi/vmcoreinfo.c | 16 
>  3 files changed, 49 insertions(+)
> 
> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
> index f7c6635f15..16c3d7cb10 100644
> --- a/scripts/dump-guest-memory.py
> +++ b/scripts/dump-guest-memory.py
> @@ -120,6 +120,20 @@ class ELF(object):
>  self.segments[0].p_filesz += ctypes.sizeof(note)
>  self.segments[0].p_memsz += ctypes.sizeof(note)
>  
> +
> +def add_vmcoreinfo_note(self, vmcoreinfo):
> +"""Adds a vmcoreinfo note to the ELF dump."""
> +chead = type(get_arch_note(self.endianness, 0, 0))
> +header = chead.from_buffer_copy(vmcoreinfo[0:ctypes.sizeof(chead)])
> +note = get_arch_note(self.endianness,
> + header.n_namesz - 1, header.n_descsz)
> +ctypes.memmove(ctypes.pointer(note), vmcoreinfo, ctypes.sizeof(note))
> +header_size = ctypes.sizeof(note) - header.n_descsz
> +
> +self.notes.append(note)
> +self.segments[0].p_filesz += ctypes.sizeof(note)
> +self.segments[0].p_memsz += ctypes.sizeof(note)
> +
>  def add_segment(self, p_type, p_paddr, p_size):
>  """Adds a segment to the elf."""
>  
> @@ -505,6 +519,23 @@ shape and this command should mostly work."""
>  cur += chunk_size
>  left -= chunk_size
>  
> +def add_vmcoreinfo(self):
> +qemu_core = gdb.inferiors()[0]
> +
> +gdb.execute("call vmcoreinfo_gdb_update()")

I think it's a bad idea to call a function from a process that's just
crashed.

If this feature is so important, maybe we can simply set a global
pointer variable at the end of vmcoreinfo_realize(); something like:

static void vmcoreinfo_realize(DeviceState *dev, Error **errp)
{
static VmcoreinfoState * volatile vmcoreinfo_gdb_helper;
[...]
vmcoreinfo_gdb_helper = VMCOREINFO(dev);
}

- vmcoreinfo_gdb_helper has function scope, so no other code can abuse
  it
- it has static storage duration so gdb can access it at any time
- the pointer (not the pointed-to object) is qualified volatile, so gcc
  cannot optimize out the pointer assignment (which it might be tempted
  to do otherwise, due to the pointer never being read within QEMU)

Then you can use "vmcoreinfo_gdb_helper->vmcoreinfo_addr_le" to
implement all the logic in "dump-guest-memory.py".

Just my two cents, of course.

Thanks
Laszlo

> +avail = gdb.parse_and_eval("vmcoreinfo_gdb.available")
> +if not avail:
> +return;
> +
> +addr = gdb.parse_and_eval("vmcoreinfo_gdb.paddr")
> +size = gdb.parse_and_eval("vmcoreinfo_gdb.size")
> +for block in self.guest_phys_blocks:
> +if block["target_start"] <= addr < block["target_end"]:
> +haddr = block["host_addr"] + (addr - block["target_start"])
> +vmcoreinfo = qemu_core.read_memory(haddr, size)
> +self.elf.add_vmcoreinfo_note(vmcoreinfo.tobytes())
> +return
> +
>  def invoke(self, args, from_tty):
>  """Handles command invocation from gdb."""
>  
> @@ -518,6 +549,7 @@ shape and this command should mostly work."""
>  
>  self.elf = ELF(argv[1])
>  self.guest_phys_blocks = get_guest_phys_blocks()
> +self.add_vmcoreinfo()
>  
>  with open(argv[0], "wb") as vmcore:
>  self.dump_init(vmcore)
> diff --git a/include/hw/acpi/vmcoreinfo.h b/include/hw/acpi/vmcoreinfo.h
> index 40fe99c3ed..4efa678237 100644
> --- a/include/hw/acpi/vmcoreinfo.h
> +++ b/include/hw/acpi/vmcoreinfo.h
> @@ -32,5 +32,6 @@ void vmcoreinfo_build_acpi(VmcoreinfoState *vis, GArray 
> *table_data,
>  void vmcoreinfo_add_fw_cfg(VmcoreinfoState *vis, FWCfgState *s, GArray 
> *vmci);
>  bool vmcoreinfo_get(VmcoreinfoState *vis, uint64_t *paddr, uint64_t *size,
>  Error **errp);
> +void vmcoreinfo_gdb_update(void);
>  
>  #endif
> diff --git a/hw/acpi/vmcoreinfo.c b/hw/acpi/vmcoreinfo.c
> index 216e0bb83a..75e3330813 100644
> --- a/hw/acpi/vmcoreinfo.c
> +++ b/hw/acpi/vmcoreinfo.c
> @@ -145,6 +145,22 @@ bool vmcoreinfo_get(VmcoreinfoState *vis,
>  return true;
>  }
>  
> +struct vmcoreinfo_gdb {
> +bool available;
> +uint64_t paddr;
> +uint64_t size;
> +} vmcoreinfo_gdb;
> +
> +void vmcoreinfo_gdb_update(void)
> +{
> +Object *vmci = find_vmcoreinfo_dev();
> +
> +vmcoreinfo_gdb.available = vmci ?
> +vmcoreinfo_get(VMCOREINFO(vmci),
> +   &vmcoreinfo_gdb.paddr, &vmcoreinfo_gdb.size, NULL)
> +: false;
> +}
> +
>  static const V

Re: [Qemu-devel] [PATCH 7/7] MAINTAINERS: add Dump maintainers

2017-07-04 Thread Laszlo Ersek
On 06/29/17 15:23, Marc-André Lureau wrote:
> Proposing myself, since I have some familiarity with the code now.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  MAINTAINERS | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 839f7ca063..45a0eb4cb0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1272,6 +1272,13 @@ S: Maintained
>  F: device_tree.c
>  F: include/sysemu/device_tree.h
>  
> +Dump
> +S: Supported
> +M: Marc-André Lureau 
> +F: dump.c
> +F: include/sysemu/dump.h
> +F: include/sysemu/dump-arch.h
> +
>  Error reporting
>  M: Markus Armbruster 
>  S: Supported
> 

That's very kind of you, thanks!

Do you have suggestions for the following files?

scripts/dump-guest-memory.py
stubs/dump.c
target/arm/arch_dump.c
target/i386/arch_dump.c
target/ppc/arch_dump.c
target/s390x/arch_dump.c

(If not, I'm OK with that too.)

Thanks!
Laszlo



Re: [Qemu-devel] change x86 default machine type to Q35?

2017-07-05 Thread Laszlo Ersek
On 07/05/17 08:57, Chao Peng wrote:
> Hi,
>  
> Q35 has been in QEMU for quite a while. Compared to the current default
> i440FX, Q35 is probably not that mature and not widely used, however in
> some case, Q35 has advantages, for example, in supporting new features.
> For instance, we have some features require PCI-e support which is only
> available on Q35 and some others need it for EFI support.

To be a bit more precise: OVMF's default upstream build works fine with
i440fx. But, if you build OVMF with "-D SMM_REQUIRE" -- which is
required for making "-D SECURE_BOOT_ENABLE" actually secure --, then you
do need q35 (the firmware binary won't even boot on i440fx past a
certain point), because only q35 provides SMM emulation.

Thanks
Laszlo

> It is of
> course not necessary to change it as the default but if more and more
> features have dependencies on Q35 because of requiring much more modern
> features then I think it may be worth to do so. In such case we can have
> more people to use it and find problems we may know or not know. There
> are certainly some drawbacks:
> -Compatibility: current code or script may need adjustment
> -Quality: we may suffer more bugs on Q35
>  
> Any thoughts?
>  
> Chao
> 




Re: [Qemu-devel] [PATCH 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-05 Thread Laszlo Ersek
On 07/05/17 11:58, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jul 5, 2017 at 2:22 AM, Laszlo Ersek  wrote:
>> On 06/29/17 15:23, Marc-André Lureau wrote:
>>> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
>>>
>>> To help the python script, add a little global vmcoreinfo_gdb
>>> structure, that is populated with vmcoreinfo_gdb_update().
>>>
>>> Signed-off-by: Marc-André Lureau 
>>> ---
>>>  scripts/dump-guest-memory.py | 32 
>>>  include/hw/acpi/vmcoreinfo.h |  1 +
>>>  hw/acpi/vmcoreinfo.c | 16 
>>>  3 files changed, 49 insertions(+)
>>>
>>> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
>>> index f7c6635f15..16c3d7cb10 100644
>>> --- a/scripts/dump-guest-memory.py
>>> +++ b/scripts/dump-guest-memory.py
>>> @@ -120,6 +120,20 @@ class ELF(object):
>>>  self.segments[0].p_filesz += ctypes.sizeof(note)
>>>  self.segments[0].p_memsz += ctypes.sizeof(note)
>>>
>>> +
>>> +def add_vmcoreinfo_note(self, vmcoreinfo):
>>> +"""Adds a vmcoreinfo note to the ELF dump."""
>>> +chead = type(get_arch_note(self.endianness, 0, 0))
>>> +header = chead.from_buffer_copy(vmcoreinfo[0:ctypes.sizeof(chead)])
>>> +note = get_arch_note(self.endianness,
>>> + header.n_namesz - 1, header.n_descsz)
>>> +ctypes.memmove(ctypes.pointer(note), vmcoreinfo, 
>>> ctypes.sizeof(note))
>>> +header_size = ctypes.sizeof(note) - header.n_descsz
>>> +
>>> +self.notes.append(note)
>>> +self.segments[0].p_filesz += ctypes.sizeof(note)
>>> +self.segments[0].p_memsz += ctypes.sizeof(note)
>>> +
>>>  def add_segment(self, p_type, p_paddr, p_size):
>>>  """Adds a segment to the elf."""
>>>
>>> @@ -505,6 +519,23 @@ shape and this command should mostly work."""
>>>  cur += chunk_size
>>>  left -= chunk_size
>>>
>>> +def add_vmcoreinfo(self):
>>> +qemu_core = gdb.inferiors()[0]
>>> +
>>> +gdb.execute("call vmcoreinfo_gdb_update()")
>>
>> I think it's a bad idea to call a function from a process that's just
>> crashed.
> 
> Yeah, if qemu crashed you can't use that script. But we are talking
> about dump of guest kernel, so qemu didn't crash :)

I think we have a misunderstanding here. Extracting the guest kernel
core from the core dump of a crashed QEMU process is the *only* purpose
of the GDB extension implemented in "dump-guest-memory.py".

In other words, if you are loading "dump-guest-memory.py" into gdb, then
QEMU crashed *by definition*. Because otherwise you'd dump the guest
kernel core using the live monitor commands (HMP or QMP).

So, "dump-guest-memory.py" is really a last resort utility, for the case
when the guest kernel does something "interesting" that crashes QEMU
with hopefully localized damage, and most of the data structures
hopefully remain usable. It is not guaranteed at all that
"dump-guest-memory.py" will produce anything useful, dependent on how
corrupt the QEMU process memory is at the time of the SIGSEGV or SIGABRT
(or another fatal signal).

Please see the message on original QEMU commit 3e16d14fd93c
("Python-lang gdb script to extract x86_64 guest vmcore from qemu
coredump", 2013-12-17).

See also the original RFE -- I apologize to non-Red Hatters, the RHBZ is
private because it was filed by a customer --:

https://bugzilla.redhat.com/show_bug.cgi?id=826266

In my opinion, poking at possibly corrupt data structures with the
python script is OK, while executing code directly from the crashed
image is too much. But, again, that's just my opinion.

> 
>>
>> If this feature is so important, maybe we can simply set a global
>> pointer variable at the end of vmcoreinfo_realize(); something like:
>>
>> static void vmcoreinfo_realize(DeviceState *dev, Error **errp)
>> {
>> static VmcoreinfoState * volatile vmcoreinfo_gdb_helper;
>> [...]
>> vmcoreinfo_gdb_helper = VMCOREINFO(dev);
>> }
>>
>> - vmcoreinfo_gdb_helper has function scope, so no other code can abuse
>>   it
>> - it has static storage duration so gdb can access it at any time
>> - the pointer (not the pointed-to object) is qualified volatile, so gcc
>>   cannot optimize out the pointer assignment (which it might be tempted
>>   to do otherwise, due to the pointer never being read within QEMU)
>>
>> Then you can use "vmcoreinfo_gdb_helper->vmcoreinfo_addr_le" to
>> implement all the logic in "dump-guest-memory.py".
> 
> If necessary, I can try that.

Well, I can't claim it is "objectively necessary"; it's just that this
method would satisfy my preferences above (i.e., poking at process data
from python: OK; running code from the process: not OK).

Thanks,
Laszlo



Re: [Qemu-devel] The maximum limit of virtual network device

2017-07-06 Thread Laszlo Ersek
Hi Jiaxin,

it's nice to see a question from you on qemu-devel! :)

On 07/06/17 08:20, Wu, Jiaxin wrote:
> Hello experts,
>
> We know QEMU has the capability to create the multiple network devices
> in one QEMU guest with the -device syntax. But I met the below failure
> when I'm trying to create more than 30 virtual devices with the each
> TAP backend:
>
> qemu-system-x86_64: -device e1000: PCI: no slot/function available for
> e1000, all in use.
>
> The corresponding QEMU command shows as following:
>
> sudo qemu-system-x86_64 \
>   -pflash OVMF.fd \
>   -global e1000.romfile="" \
>   -netdev tap,id=hostnet0,ifname=tap0,script=no,downscript=no \
>   -device e1000,netdev=hostnet0 \
>   -netdev tap,id=hostnet1,ifname=tap1,script=no,downscript=no \
>   -device e1000,netdev=hostnet1 \
>   -netdev tap,id=hostnet2,ifname=tap2,script=no,downscript=no \
>   -device e1000,netdev=hostnet2 \
>   -netdev tap,id=hostnet3,ifname=tap3,script=no,downscript=no \
>   -device e1000,netdev=hostnet3 \
>   -netdev tap,id=hostnet4,ifname=tap4,script=no,downscript=no \
>   -device e1000,netdev=hostnet4 \
>   -netdev tap,id=hostnet5,ifname=tap5,script=no,downscript=no \
>   -device e1000,netdev=hostnet5 \
>   -netdev tap,id=hostnet6,ifname=tap6,script=no,downscript=no \
>   -device e1000,netdev=hostnet6 \
>   -netdev tap,id=hostnet7,ifname=tap7,script=no,downscript=no \
>   -device e1000,netdev=hostnet7 \
>   -netdev tap,id=hostnet8,ifname=tap8,script=no,downscript=no \
>   -device e1000,netdev=hostnet8 \
>   -netdev tap,id=hostnet9,ifname=tap9,script=no,downscript=no \
>   -device e1000,netdev=hostnet9 \
>   -netdev tap,id=hostnet10,ifname=tap10,script=no,downscript=no \
>   -device e1000,netdev=hostnet10 \
>   -netdev tap,id=hostnet11,ifname=tap11,script=no,downscript=no \
>   -device e1000,netdev=hostnet11 \
>   -netdev tap,id=hostnet12,ifname=tap12,script=no,downscript=no \
>   -device e1000,netdev=hostnet12 \
>   -netdev tap,id=hostnet13,ifname=tap13,script=no,downscript=no \
>   -device e1000,netdev=hostnet13 \
>   -netdev tap,id=hostnet14,ifname=tap14,script=no,downscript=no \
>   -device e1000,netdev=hostnet14 \
>   -netdev tap,id=hostnet15,ifname=tap15,script=no,downscript=no \
>   -device e1000,netdev=hostnet15 \
>   -netdev tap,id=hostnet16,ifname=tap16,script=no,downscript=no \
>   -device e1000,netdev=hostnet16 \
>   -netdev tap,id=hostnet17,ifname=tap17,script=no,downscript=no \
>   -device e1000,netdev=hostnet17 \
>   -netdev tap,id=hostnet18,ifname=tap18,script=no,downscript=no \
>   -device e1000,netdev=hostnet18 \
>   -netdev tap,id=hostnet19,ifname=tap19,script=no,downscript=no \
>   -device e1000,netdev=hostnet19 \
>   -netdev tap,id=hostnet20,ifname=tap20,script=no,downscript=no \
>   -device e1000,netdev=hostnet20 \
>   -netdev tap,id=hostnet21,ifname=tap21,script=no,downscript=no \
>   -device e1000,netdev=hostnet21 \
>   -netdev tap,id=hostnet22,ifname=tap22,script=no,downscript=no \
>   -device e1000,netdev=hostnet22 \
>   -netdev tap,id=hostnet23,ifname=tap23,script=no,downscript=no \
>   -device e1000,netdev=hostnet23 \
>   -netdev tap,id=hostnet24,ifname=tap24,script=no,downscript=no \
>   -device e1000,netdev=hostnet24 \
>   -netdev tap,id=hostnet25,ifname=tap25,script=no,downscript=no \
>   -device e1000,netdev=hostnet25 \
>   -netdev tap,id=hostnet26,ifname=tap26,script=no,downscript=no \
>   -device e1000,netdev=hostnet26 \
>   -netdev tap,id=hostnet27,ifname=tap27,script=no,downscript=no \
>   -device e1000,netdev=hostnet27 \
>   -netdev tap,id=hostnet28,ifname=tap28,script=no,downscript=no \
>   -device e1000,netdev=hostnet28 \
>   -netdev tap,id=hostnet29,ifname=tap29,script=no,downscript=no \
>   -device e1000,netdev=hostnet29
>
> From above,  the max limit of virtual network device in one guest is
> about 29? If not, how can I avoid such failure? My use case is to
> create more than 150 network devices in one guest. Please provide your
> comments on this.

You are seeing the above symptom because the above command line
instructs QEMU to do the following:
- use the i440fx machine type,
- use a single PCI bus (= the main root bridge),
- add the e1000 cards to separate slots (always using function 0) on
  that bus.

Accordingly, there are three things you can do to remedy this:

- Use the Q35 machine type and work with a PCI Express hierarchy rather
  than a PCI hierarchy. I'm mentioning this only for completeness,
  because it won't directly help your use case. But, I certainly want to
  highlight "docs/pcie.txt". Please read it sometime; it has nice
  examples and makes good points.

- Use multiple PCI bridges to attach the devices. For this, several ways
  are possible:

  - use multiple root buses, with the pxb or pxb-pcie devices (see
"docs/pci_expander_bridge.txt" and "docs/pcie.txt")

  - use multiple normal PCI bridges

  - use multiple PCI Express root ports or downstream ports (but for
this, you'll likely have to use the PCI Express variant of the
e1000, namely e1000e)

- I

Re: [Qemu-devel] The maximum limit of virtual network device

2017-07-06 Thread Laszlo Ersek
On 07/06/17 11:24, Marcel Apfelbaum wrote:
> On 06/07/2017 11:31, Laszlo Ersek wrote:

>> > Now, I would normally recommend sticking with i440fx for simplicity.
>> However, each PCI bridge requires 4KB of IO space (meaning (1 + 5) * 4KB
>> = 24KB),  and OVMF on the i440fx does not support that much (only
>> 0x4000). So, I'll recommend Q35 for IO space purposes; OVMF on Q35
>> provides 0xA000 (40KB).
> 
> So if we use OVMF, going for Q35 gives us actually more IO space, nice!
> However recommending Q35 for IO space seems odd :)

OVMF used to have only 0x4000 bytes for PCI IO aperture, since the
beginning. In <https://bugzilla.redhat.com/show_bug.cgi?id=1333238> I
investigated how much I could grow the aperture. On Q35, it was possible
to grow it to 0xA000 bytes (but even then you have to disable vmport,
which sort of sits in the middle otherwise). On i440fx, the IO ports in
use by platform devices were so badly distributed that moving beyond
0x4000 was not possible. See in particular:

https://bugzilla.redhat.com/show_bug.cgi?id=1333238#c16
https://bugzilla.redhat.com/show_bug.cgi?id=1333238#c19

>> Therefore I guess the simplest example I can give now is:
>> - use Q35 (for a larger IO space),
>> - plug a DMI-PCI bridge into the root bridge,
>> - plug 5 PCI bridges into the DMI-PCI bridge,
>> - plug 31 NICs per PCI bridge, each NIC into a separate slot.
>>
> 
> The setup looks OK to me (assuming OVMF is needed, otherwise
> PC + pci-bridges will result in more devices),

OVMF is quite needed; Jiaxin is one of the edk2 networking maintainers,
and I think he's using QEMU and OVMF as a testbed for otherwise
physically-oriented UEFI development.

> I do have a little concern.
> We want to deprecate the dmi-pci bridge since it does not support
> hot-plug (for itself or devices behind it).
> Alexandr (CCed) is a GSOC student working on a generic
> pcie-pci bridge that can (eventually) be hot-plugged
> into a PCIe Root Port and keeps the machine cleaner.

Nice!

Please include an update to "docs/pcie.txt" in the scope :)

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH 7/7] MAINTAINERS: add Dump maintainers

2017-07-06 Thread Laszlo Ersek
On 07/06/17 11:54, Marc-André Lureau wrote:
> Hi
> 
> - Original Message -
>> On 06/29/17 15:23, Marc-André Lureau wrote:
>>> Proposing myself, since I have some familiarity with the code now.
>>>
>>> Signed-off-by: Marc-André Lureau 
>>> ---
>>>  MAINTAINERS | 7 +++
>>>  1 file changed, 7 insertions(+)
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 839f7ca063..45a0eb4cb0 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -1272,6 +1272,13 @@ S: Maintained
>>>  F: device_tree.c
>>>  F: include/sysemu/device_tree.h
>>>  
>>> +Dump
>>> +S: Supported
>>> +M: Marc-André Lureau 
>>> +F: dump.c
>>> +F: include/sysemu/dump.h
>>> +F: include/sysemu/dump-arch.h
>>> +
>>>  Error reporting
>>>  M: Markus Armbruster 
>>>  S: Supported
>>>
>>
>> That's very kind of you, thanks!
>>
>> Do you have suggestions for the following files?
>>
>> scripts/dump-guest-memory.py
> 
> This one is yours, no? :) But I am ok to "support" it, meaning I'll take time 
> to review the patches, and eventually make the pull-requests.

It used to be "mine" until (a) it got rewritten in object-oriented
Python, and (b) it received multi-arch support ;)

>  
>> stubs/dump.c
> 
> I can add that one, although it is also maintained by Paolo
> 
>> target/arm/arch_dump.c
>> target/i386/arch_dump.c
>> target/ppc/arch_dump.c
>> target/s390x/arch_dump.c
> 
> I'd rather have those maintained by the respective arch maintainers, as they 
> are. But I imagine it would make sense to also cover them with the rest of 
> dump.

Yeah there's an overlap here -- I'm not suggesting that you take on
everything, just curious what you think. I'm fine with this patch as it is.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH 4/7] dump: add vmcoreinfo ELF note

2017-07-06 Thread Laszlo Ersek
On 07/05/17 23:52, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jul 5, 2017 at 1:48 AM, Laszlo Ersek  wrote:
>> On 06/29/17 15:23, Marc-André Lureau wrote:

>>> @@ -258,6 +280,8 @@ static void write_elf64_notes(WriteCoreDumpFunction f, 
>>> DumpState *s,
>>>  return;
>>>  }
>>>  }
>>> +
>>> +write_vmcoreinfo_note(f, s, errp);
>>>  }
>>>
>>>  static void write_elf32_note(DumpState *s, Error **errp)
>>> @@ -303,6 +327,8 @@ static void write_elf32_notes(WriteCoreDumpFunction f, 
>>> DumpState *s,
>>>  return;
>>>  }
>>>  }
>>> +
>>> +write_vmcoreinfo_note(f, s, errp);
>>>  }
>>>
>>
>> Wait, I'm confused again. You explained why it was OK to hook this logic
>> into the kdump handling too, but I don't think I understand your
>> explanation, so let me repeat my confusion below :)
>>
>> In the ELF case, this code works fine, I think. As long as the guest
>> provided us with a well-formed note, a well-formed note will be appended
>> to the ELF dump.
>>
>> But, this code is also invoked in the kdump case, and I don't understand
>> why that's a good thing. If I understand the next patch correctly, the
>> kdump format already provides crash with a (trimmed) copy of the guest
>> kernels vmcoreinfo note. So in the kdump case, why do we have to create
>> yet another copy of the guest kernel's vmcoreinfo note?
>>
>> Thus, my confusion persists, and I can only think (again) that
>> write_vmcoreinfo_note() should be called from dump_begin() only (at the
>> end). (And the s->note_size adjustment should take that into account.)
>>
>> Alternatively, we should keep this logic, and drop patch #5.
> 
> You are right, sorry for my misunderstanding, although crash is seems
> fine as long as size/offsets are ok.
> 
> So, instead of duplicating the info, let's keep the complete ELF note
> (with the header) in the dump, and simply adjust kdump header to point
> to the adjusted offset/size. This has the advantage of not making the
> code unnecessarily more complicated wrt s->note_size handling etc, and
> is consistent with the rest of the elf notes.

I guess that's OK, although I don't really see why even that is
necessary. The vmcoreinfo note should be possible to find in the kdump
output with the rest of the ELF notes, even without pointing some kdump
header fields into that specific note.

Snipping the rest because I'm going to see updates on those things in v2
anyway (which you've just posted).

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v2 1/7] vmgenid: replace x-write-pointer-available hack

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> This compat property sole function is to prevent the device from being
> instantiated. Instead of requiring an extra compat property, check if
> fw_cfg has DMA enabled.
> 
> fw_cfg is a built-in device that is initialized very early by the
> machine init code.  We have at least one other device that also
> assumes fw_cfg_find() can be safely used on realize: pvpanic.
> 
> This has the additional benefit of handling other cases properly, like:
> 
>   $ qemu-system-x86_64 -device vmgenid -machine none
>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support in 
> fw_cfg, which this machine type does not provide
>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.9 -global 
> fw_cfg.dma_enabled=off
>   qemu-system-x86_64: -device vmgenid: vmgenid requires DMA write support in 
> fw_cfg, which this machine type does not provide
>   $ qemu-system-x86_64 -device vmgenid -machine pc-i440fx-2.6 -global 
> fw_cfg.dma_enabled=on
>   [boots normally]
> 
> Suggested-by: Eduardo Habkost 
> Signed-off-by: Marc-André Lureau 
> Reviewed-by: Michael S. Tsirkin 
> Reviewed-by: Eduardo Habkost 
> Reviewed-by: Ben Warren 
> ---
>  include/hw/acpi/bios-linker-loader.h |  2 ++
>  include/hw/compat.h  |  4 
>  hw/acpi/bios-linker-loader.c | 10 ++
>  hw/acpi/vmgenid.c|  9 +
>  4 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/include/hw/acpi/bios-linker-loader.h 
> b/include/hw/acpi/bios-linker-loader.h
> index efe17b0b9c..a711dbced8 100644
> --- a/include/hw/acpi/bios-linker-loader.h
> +++ b/include/hw/acpi/bios-linker-loader.h
> @@ -7,6 +7,8 @@ typedef struct BIOSLinker {
>  GArray *file_list;
>  } BIOSLinker;
>  
> +bool bios_linker_loader_can_write_pointer(void);
> +
>  BIOSLinker *bios_linker_loader_init(void);
>  
>  void bios_linker_loader_alloc(BIOSLinker *linker,
> diff --git a/include/hw/compat.h b/include/hw/compat.h
> index 08f36004da..f414786604 100644
> --- a/include/hw/compat.h
> +++ b/include/hw/compat.h
> @@ -150,10 +150,6 @@
>  .driver   = "fw_cfg_io",\
>  .property = "dma_enabled",\
>  .value= "off",\
> -},{\
> -.driver   = "vmgenid",\
> -.property = "x-write-pointer-available",\
> -.value= "off",\
>  },
>  
>  #define HW_COMPAT_2_3 \
> diff --git a/hw/acpi/bios-linker-loader.c b/hw/acpi/bios-linker-loader.c
> index 046183a0f1..388d932538 100644
> --- a/hw/acpi/bios-linker-loader.c
> +++ b/hw/acpi/bios-linker-loader.c
> @@ -168,6 +168,16 @@ bios_linker_find_file(const BIOSLinker *linker, const 
> char *name)
>  return NULL;
>  }
>  
> +/*
> + * board code must realize fw_cfg first, as a fixed device, before
> + * another device realize function call 
> bios_linker_loader_can_write_pointer()
> +*/

The closing "*/" is not correctly indented.

With that fixed,

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo

> +bool bios_linker_loader_can_write_pointer(void)
> +{
> +FWCfgState *fw_cfg = fw_cfg_find();
> +return fw_cfg && fw_cfg_dma_enabled(fw_cfg);
> +}
> +
>  /*
>   * bios_linker_loader_alloc: ask guest to load file into guest memory.
>   *
> diff --git a/hw/acpi/vmgenid.c b/hw/acpi/vmgenid.c
> index a32b847fe0..ab5da293fd 100644
> --- a/hw/acpi/vmgenid.c
> +++ b/hw/acpi/vmgenid.c
> @@ -205,17 +205,11 @@ static void vmgenid_handle_reset(void *opaque)
>  memset(vms->vmgenid_addr_le, 0, ARRAY_SIZE(vms->vmgenid_addr_le));
>  }
>  
> -static Property vmgenid_properties[] = {
> -DEFINE_PROP_BOOL("x-write-pointer-available", VmGenIdState,
> - write_pointer_available, true),
> -DEFINE_PROP_END_OF_LIST(),
> -};
> -
>  static void vmgenid_realize(DeviceState *dev, Error **errp)
>  {
>  VmGenIdState *vms = VMGENID(dev);
>  
> -if (!vms->write_pointer_available) {
> +if (!bios_linker_loader_can_write_pointer()) {
>  error_setg(errp, "%s requires DMA write support in fw_cfg, "
> "which this machine type does not provide", 
> VMGENID_DEVICE);
>  return;
> @@ -239,7 +233,6 @@ static void vmgenid_device_class_init(ObjectClass *klass, 
> void *data)
>  dc->vmsd = &vmstate_vmgenid;
>  dc->realize = vmgenid_realize;
>  dc->hotpluggable = false;
> -dc->props = vmgenid_properties;
>  
>  object_class_property_add_str(klass, VMGENID_GUID, NULL,
>vmgenid_set_guid, NULL);
> 




Re: [Qemu-devel] [PATCH v2 2/7] acpi: add vmcoreinfo device

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> The VM coreinfo (vmcoreinfo) device is an emulated device which
> exposes a 4k memory range to the guest to store various informations
> useful to debug the guest OS. (it is greatly inspired by the VMGENID
> device implementation)
> 
> This is an early-boot alternative to the qemu-ga VMDUMP_INFO event
> proposed in "[PATCH 00/21] WIP: dump: add kaslr support".
> 
> A proof-of-concept kernel module:
> https://github.com/elmarco/vmgenid-test/blob/master/qemuvmci-test.c
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  include/hw/acpi/aml-build.h|   1 +
>  include/hw/acpi/vmcoreinfo.h   |  36 +++
>  hw/acpi/aml-build.c|   2 +
>  hw/acpi/vmcoreinfo.c   | 208 
> +
>  hw/i386/acpi-build.c   |  14 +++
>  default-configs/arm-softmmu.mak|   1 +
>  default-configs/i386-softmmu.mak   |   1 +
>  default-configs/x86_64-softmmu.mak |   1 +
>  docs/specs/vmcoreinfo.txt  | 138 
>  hw/acpi/Makefile.objs  |   1 +
>  10 files changed, 403 insertions(+)
>  create mode 100644 include/hw/acpi/vmcoreinfo.h
>  create mode 100644 hw/acpi/vmcoreinfo.c
>  create mode 100644 docs/specs/vmcoreinfo.txt

Reviewed-by: Laszlo Ersek 




Re: [Qemu-devel] [PATCH v2 3/7] tests: add simple vmcoreinfo test

2017-07-06 Thread Laszlo Ersek
I didn't look at this patch in the previous version, hoping that someone
else would. Apparently that hasn't happened, so I'll comment, purely
based on a comparison with the vmgenid test:

On 07/06/17 12:16, Marc-André Lureau wrote:
> This test is based off vmgenid test from Ben Warren
> . It simply checks the vmcoreinfo ACPI device
> is present and that the memory region associated can be read.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  tests/vmcoreinfo-test.c | 130 
> 
>  tests/Makefile.include  |   2 +
>  2 files changed, 132 insertions(+)
>  create mode 100644 tests/vmcoreinfo-test.c
> 
> diff --git a/tests/vmcoreinfo-test.c b/tests/vmcoreinfo-test.c
> new file mode 100644
> index 00..c8b073366e
> --- /dev/null
> +++ b/tests/vmcoreinfo-test.c
> @@ -0,0 +1,130 @@
> +/*
> + * QTest testcase for VM coreinfo device
> + *
> + * Copyright (c) 2017 Red Hat, Inc.
> + * Copyright (c) 2017 Skyport Systems
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +#include "qemu/uuid.h"
> +#include "hw/acpi/acpi-defs.h"
> +#include "acpi-utils.h"
> +#include "libqtest.h"
> +
> +#define VMCOREINFO_OFFSET 40 /* allow space for
> +  * OVMF SDT Header Probe Supressor
> +  */
> +#define RSDP_ADDR_INVALID 0x10 /* RSDP must be below this address */
> +#define RSDP_SLEEP_US 10   /* Sleep for 100ms between tries */
> +#define RSDP_TRIES_MAX100  /* Max total time is 10 seconds */
> +
> +typedef struct {
> +AcpiTableHeader header;
> +gchar name_op;
> +gchar vcia[4];
> +gchar val_op;
> +uint32_t vcia_val;
> +} QEMU_PACKED VgidTable;

This should be called something resembling "vmcoreinfo table".

> +
> +static uint32_t acpi_find_vcia(void)
> +{
> +uint32_t off;
> +    AcpiRsdpDescriptor rsdp_table;
> +uint32_t rsdt;
> +AcpiRsdtDescriptorRev1 rsdt_table;
> +int tables_nr;
> +uint32_t *tables;
> +AcpiTableHeader ssdt_table;
> +VgidTable vgid_table;

ditto.

With these fixed:

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo

> +int i;
> +
> +/* Tables may take a short time to be set up by the guest */
> +for (i = 0; i < RSDP_TRIES_MAX; i++) {
> +off = acpi_find_rsdp_address();
> +if (off < RSDP_ADDR_INVALID) {
> +break;
> +}
> +g_usleep(RSDP_SLEEP_US);
> +}
> +g_assert_cmphex(off, <, RSDP_ADDR_INVALID);
> +
> +acpi_parse_rsdp_table(off, &rsdp_table);
> +
> +rsdt = rsdp_table.rsdt_physical_address;
> +/* read the header */
> +ACPI_READ_TABLE_HEADER(&rsdt_table, rsdt);
> +ACPI_ASSERT_CMP(rsdt_table.signature, "RSDT");
> +
> +/* compute the table entries in rsdt */
> +tables_nr = (rsdt_table.length - sizeof(AcpiRsdtDescriptorRev1)) /
> +sizeof(uint32_t);
> +g_assert_cmpint(tables_nr, >, 0);
> +
> +/* get the addresses of the tables pointed by rsdt */
> +tables = g_new0(uint32_t, tables_nr);
> +ACPI_READ_ARRAY_PTR(tables, tables_nr, rsdt);
> +
> +for (i = 0; i < tables_nr; i++) {
> +ACPI_READ_TABLE_HEADER(&ssdt_table, tables[i]);
> +if (!strncmp((char *)ssdt_table.oem_table_id, "VMCOREIN", 8)) {
> +/* the first entry in the table should be VCIA
> + * That's all we need
> + */
> +ACPI_READ_FIELD(vgid_table.name_op, tables[i]);
> +g_assert(vgid_table.name_op == 0x08);  /* name */
> +ACPI_READ_ARRAY(vgid_table.vcia, tables[i]);
> +g_assert(memcmp(vgid_table.vcia, "VCIA", 4) == 0);
> +ACPI_READ_FIELD(vgid_table.val_op, tables[i]);
> +g_assert(vgid_table.val_op == 0x0C);  /* dword */
> +ACPI_READ_FIELD(vgid_table.vcia_val, tables[i]);
> +/* The GUID is written at a fixed offset into the fw_cfg file
> + * in order to implement the "OVMF SDT Header probe suppressor"
> + * see docs/specs/vmgenid.txt for more details
> + */
> +g_free(tables);
> +return vgid_table.vcia_val;
> +}
> +}
> +g_free(tables);
> +return 0;
> +}
> +
> +static void vmcoreinfo_test(void)
> +{
> +gchar *cmd;
> +uint32_t vmci_addr;

Re: [Qemu-devel] [PATCH v2 4/7] dump: add vmcoreinfo ELF note

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> Read the vmcoreinfo ELF PT_NOTE from guest memory when vmcoreinfo
> device provides the location, and write it as an ELF note in the dump.
> 
> There are now 2 possible sources of phys_base information.
> 
> (1) arch guessed value from cpu_dump_info_get()

I recommend using the clipboard; the function is still called
cpu_get_dump_info() :)

> (2) vmcoreinfo ELF note NUMBER(phys_base)= field
> 
> NUMBER(phys_base) in vmcoreinfo has only been recently introduced
> in Linux 4.10 (401721ecd1dc "kexec: export the value of phys_base
> instead of symbol address").
> 
> Since (2) has better chances to be accurate, the guessed value is
> replaced by the value from the vmcoreinfo ELF note.
> 
> The phys_base value is stored in the same dump field locations as
> before, and may duplicate the information available in the vmcoreinfo
> ELF PT_NOTE. Crash tools should be prepared to handle this case.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  include/sysemu/dump.h |   2 +
>  dump.c| 125 
> ++
>  2 files changed, 127 insertions(+)
> 
> diff --git a/include/sysemu/dump.h b/include/sysemu/dump.h
> index 2672a15f8b..111a7dcaa4 100644
> --- a/include/sysemu/dump.h
> +++ b/include/sysemu/dump.h
> @@ -192,6 +192,8 @@ typedef struct DumpState {
>* this could be used to calculate
>* how much work we have
>* finished. */
> +uint8_t *vmcoreinfo; /* ELF note content */
> +size_t vmcoreinfo_size;
>  } DumpState;
>  
>  uint16_t cpu_to_dump16(DumpState *s, uint16_t val);
> diff --git a/dump.c b/dump.c
> index d9090a24cc..f699198204 100644
> --- a/dump.c
> +++ b/dump.c
> @@ -26,6 +26,8 @@
>  #include "qapi/qmp/qerror.h"
>  #include "qmp-commands.h"
>  #include "qapi-event.h"
> +#include "qemu/error-report.h"
> +#include "hw/acpi/vmcoreinfo.h"
>  
>  #include 
>  #ifdef CONFIG_LZO
> @@ -38,6 +40,11 @@
>  #define ELF_MACHINE_UNAME "Unknown"
>  #endif
>  
> +#define ELF_NOTE_SIZE(hdr_size, name_size, desc_size)   \
> +((DIV_ROUND_UP((hdr_size), 4) + \
> +  DIV_ROUND_UP((name_size), 4) +\
> +  DIV_ROUND_UP((desc_size), 4)) * 4)
> +
>  uint16_t cpu_to_dump16(DumpState *s, uint16_t val)
>  {
>  if (s->dump_info.d_endian == ELFDATA2LSB) {
> @@ -76,6 +83,8 @@ static int dump_cleanup(DumpState *s)
>  guest_phys_blocks_free(&s->guest_phys_blocks);
>  memory_mapping_list_free(&s->list);
>  close(s->fd);
> +g_free(s->vmcoreinfo);
> +s->vmcoreinfo = NULL;
>  if (s->resume) {
>  if (s->detached) {
>  qemu_mutex_lock_iothread();
> @@ -235,6 +244,19 @@ static inline int cpu_index(CPUState *cpu)
>  return cpu->cpu_index + 1;
>  }
>  
> +static void write_vmcoreinfo_note(WriteCoreDumpFunction f, DumpState *s,
> +  Error **errp)
> +{
> +int ret;
> +
> +if (s->vmcoreinfo) {
> +ret = f(s->vmcoreinfo, s->vmcoreinfo_size, s);
> +if (ret < 0) {
> +error_setg(errp, "dump: failed to write vmcoreinfo");
> +}
> +}
> +}
> +
>  static void write_elf64_notes(WriteCoreDumpFunction f, DumpState *s,
>Error **errp)
>  {
> @@ -258,6 +280,8 @@ static void write_elf64_notes(WriteCoreDumpFunction f, 
> DumpState *s,
>  return;
>  }
>  }
> +
> +write_vmcoreinfo_note(f, s, errp);
>  }
>  
>  static void write_elf32_note(DumpState *s, Error **errp)
> @@ -303,6 +327,8 @@ static void write_elf32_notes(WriteCoreDumpFunction f, 
> DumpState *s,
>  return;
>  }
>  }
> +
> +write_vmcoreinfo_note(f, s, errp);
>  }
>  
>  static void write_elf_section(DumpState *s, int type, Error **errp)
> @@ -714,6 +740,44 @@ static int buf_write_note(const void *buf, size_t size, 
> void *opaque)
>  return 0;
>  }
>  
> +/*
> + * This function retrieves various sizes from an elf header.
> + *
> + * @note has to be a valid ELF note. The return sizes are unmodified
> + * (not padded or rounded up to be multiple of 4).
> + */
> +static void get_note_sizes(DumpState *s, const void *note,
> +   uint64_t *note_head_size,
> +   uint64_t *name_size,
> +   uint64_t *desc_size)
> +{
> +uint64_t note_head_sz;
> +uint64_t name_sz;
> +uint64_t desc_sz;
> +
> +if (s->dump_info.d_class == ELFCLASS64) {
> +const Elf64_Nhdr *hdr = note;
> +note_head_sz = sizeof(Elf64_Nhdr);
> +name_sz = tswap64(hdr->n_namesz);
> +desc_sz = tswap64(hdr->n_descsz);
> +} else {
> +const Elf32_Nhdr *hdr = note;
> +note_head_sz = sizeof(Elf32_Nhdr);
> +name_sz = tswap32(hdr->n_namesz);
> +desc_sz = tswap32(hdr->n_descsz);
> +}
> +
> +if (note_head_size) {
> +

Re: [Qemu-devel] [PATCH v2 7/7] MAINTAINERS: add Dump maintainers

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> Proposing myself, since I have some familiarity with the code now.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  MAINTAINERS | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 839f7ca063..ba17ce5b85 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1272,6 +1272,15 @@ S: Maintained
>  F: device_tree.c
>  F: include/sysemu/device_tree.h
>  
> +Dump
> +S: Supported
> +M: Marc-André Lureau 
> +F: dump.c
> +F: stubs/dump.c
> +F: include/sysemu/dump.h
> +F: include/sysemu/dump-arch.h
> +F: scripts/dump-guest-memory.py
> +
>  Error reporting
>  M: Markus Armbruster 
>  S: Supported
> 

Acked-by: Laszlo Ersek 



Re: [Qemu-devel] [PATCH v2 5/7] kdump: add vmcoreinfo ELF note

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> kdump header provides offset and size of the vmcoreinfo ELF note,
> append it if available.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  dump.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/dump.c b/dump.c
> index f699198204..dd416ad271 100644
> --- a/dump.c
> +++ b/dump.c
> @@ -839,6 +839,16 @@ static void create_header32(DumpState *s, Error **errp)
>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>  
>  offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +if (s->vmcoreinfo) {
> +uint64_t hsize, name_size, size_vmcoreinfo_desc, offset_vmcoreinfo;
> +
> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size, 
> &size_vmcoreinfo_desc);
> +offset_vmcoreinfo = offset_note + s->note_size - s->vmcoreinfo_size +
> +(DIV_ROUND_UP(hsize, 4) + DIV_ROUND_UP(name_size, 4)) * 4;
> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
> +kh->size_vmcoreinfo = cpu_to_dump32(s, size_vmcoreinfo_desc);
> +}
> +
>  kh->offset_note = cpu_to_dump64(s, offset_note);
>  kh->note_size = cpu_to_dump32(s, s->note_size);
>  
> @@ -939,6 +949,16 @@ static void create_header64(DumpState *s, Error **errp)
>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>  
>  offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
> +if (s->vmcoreinfo) {
> +uint64_t hsize, name_size, size_vmcoreinfo_desc, offset_vmcoreinfo;
> +
> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size, 
> &size_vmcoreinfo_desc);
> +offset_vmcoreinfo = offset_note + s->note_size - s->vmcoreinfo_size +
> +(DIV_ROUND_UP(hsize, 4) + DIV_ROUND_UP(name_size, 4)) * 4;
> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
> +kh->size_vmcoreinfo = cpu_to_dump64(s, size_vmcoreinfo_desc);
> +}
> +
>  kh->offset_note = cpu_to_dump64(s, offset_note);
>  kh->note_size = cpu_to_dump64(s, s->note_size);
>  
> 

I continue to think that this patch is unnecessary, but if you insist,
it does look OK to me.

Reviewed-by: Laszlo Ersek 



Re: [Qemu-devel] [PATCH v2 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-06 Thread Laszlo Ersek
On 07/06/17 12:16, Marc-André Lureau wrote:
> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
> 
> To help the python script, add a little global vmcoreinfo_gdb
> structure, that is populated with vmcoreinfo_gdb_update().
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  scripts/dump-guest-memory.py | 40 
>  hw/acpi/vmcoreinfo.c |  3 +++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
> index f7c6635f15..2dd2ed6983 100644
> --- a/scripts/dump-guest-memory.py
> +++ b/scripts/dump-guest-memory.py
> @@ -14,6 +14,7 @@ the COPYING file in the top-level directory.
>  """
>  
>  import ctypes
> +import struct
>  
>  UINTPTR_T = gdb.lookup_type("uintptr_t")
>  
> @@ -120,6 +121,20 @@ class ELF(object):
>  self.segments[0].p_filesz += ctypes.sizeof(note)
>  self.segments[0].p_memsz += ctypes.sizeof(note)
>  
> +
> +def add_vmcoreinfo_note(self, vmcoreinfo):
> +"""Adds a vmcoreinfo note to the ELF dump."""
> +chead = type(get_arch_note(self.endianness, 0, 0))
> +header = chead.from_buffer_copy(vmcoreinfo[0:ctypes.sizeof(chead)])

Maybe it's obvious to others, but I would have been helped a lot if a
comment had explained that you are creating a fake note (with 0 desc
size and 0 name size) to figure out the size of the note header. And
then you copy that many bytes out of the vmcoreinfo ELF note.

> +note = get_arch_note(self.endianness,
> + header.n_namesz - 1, header.n_descsz)

Why the -1?

... I think I'm giving up here for this method. My python is weak and I
can't follow this too well. Please add some comments.

More comments below:

> +ctypes.memmove(ctypes.pointer(note), vmcoreinfo, ctypes.sizeof(note))
> +header_size = ctypes.sizeof(note) - header.n_descsz
> +
> +self.notes.append(note)
> +self.segments[0].p_filesz += ctypes.sizeof(note)
> +self.segments[0].p_memsz += ctypes.sizeof(note)
> +
>  def add_segment(self, p_type, p_paddr, p_size):
>  """Adds a segment to the elf."""
>  
> @@ -505,6 +520,30 @@ shape and this command should mostly work."""
>  cur += chunk_size
>  left -= chunk_size
>  
> +def phys_memory_read(self, addr, size):
> +qemu_core = gdb.inferiors()[0]
> +for block in self.guest_phys_blocks:
> +if block["target_start"] <= addr < block["target_end"]:

Although I don't expect a single read to straddle phys-blocks, I would
prefer if you checked (addr + size) -- and not just addr -- against
block["target_end"].

> +haddr = block["host_addr"] + (addr - block["target_start"])
> +return qemu_core.read_memory(haddr, size)
> +
> +def add_vmcoreinfo(self):
> +if not gdb.parse_and_eval("vmcoreinfo_gdb_helper"):
> +return
> +
> +addr = gdb.parse_and_eval("vmcoreinfo_gdb_helper.vmcoreinfo_addr_le")
> +addr = bytes([addr[i] for i in range(4)])
> +addr = struct.unpack(" +
> +mem = self.phys_memory_read(addr, 16)
> +(version, addr, size) = struct.unpack(" +if version != 0:
> +return
> +
> +vmcoreinfo = self.phys_memory_read(addr, size)
> +if vmcoreinfo:
> +self.elf.add_vmcoreinfo_note(vmcoreinfo.tobytes())
> +
>  def invoke(self, args, from_tty):
>  """Handles command invocation from gdb."""
>  
> @@ -518,6 +557,7 @@ shape and this command should mostly work."""
>  
>  self.elf = ELF(argv[1])
>  self.guest_phys_blocks = get_guest_phys_blocks()
> +self.add_vmcoreinfo()
>  
>  with open(argv[0], "wb") as vmcore:
>  self.dump_init(vmcore)
> diff --git a/hw/acpi/vmcoreinfo.c b/hw/acpi/vmcoreinfo.c
> index 0ea41de8d9..b6bcb47506 100644
> --- a/hw/acpi/vmcoreinfo.c
> +++ b/hw/acpi/vmcoreinfo.c
> @@ -163,6 +163,8 @@ static void vmcoreinfo_handle_reset(void *opaque)
>  memset(vis->vmcoreinfo_addr_le, 0, ARRAY_SIZE(vis->vmcoreinfo_addr_le));
>  }
>  
> +static VMCoreInfoState *vmcoreinfo_gdb_helper;
> +
>  static void vmcoreinfo_realize(DeviceState *dev, Error **errp)
>  {
>  if (!bios_linker_loader_can_write_pointer()) {
> @@ -181,6 +183,7 @@ static void vmcoreinfo_realize(DeviceState *dev, Error 
> **errp)
>  return;
>  }
>  
> +vmcoreinfo_gdb_helper = VMCOREINFO(dev);
>  qemu_register_reset(vmcoreinfo_handle_reset, dev);
>  }
>  
> 

I guess we don't build QEMU with link-time optimization at the moment.

With link-time optimization, I think gcc might reasonably optimize away
the assignment to "vmcoreinfo_gdb_helper", and "vmcoreinfo_gdb_helper"
itself. This is why I suggested "volatile":

static VMCoreInfoState * volatile vmcoreinfo_gdb_helper;

Do you think volatile is only superfluous, or do you actively dislike it
for some reason?

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH v2 5/7] kdump: add vmcoreinfo ELF note

2017-07-06 Thread Laszlo Ersek
On 07/06/17 20:09, Dave Anderson wrote:
> 
> 
> - Original Message -
>> Hi
>>
>> On Thu, Jul 6, 2017 at 7:13 PM, Laszlo Ersek  wrote:
>>> On 07/06/17 12:16, Marc-André Lureau wrote:
>>>> kdump header provides offset and size of the vmcoreinfo ELF note,
>>>> append it if available.
>>>>
>>>> Signed-off-by: Marc-André Lureau 
>>>> ---
>>>>  dump.c | 20 
>>>>  1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/dump.c b/dump.c
>>>> index f699198204..dd416ad271 100644
>>>> --- a/dump.c
>>>> +++ b/dump.c
>>>> @@ -839,6 +839,16 @@ static void create_header32(DumpState *s, Error
>>>> **errp)
>>>>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>>>>
>>>>  offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
>>>> +if (s->vmcoreinfo) {
>>>> +uint64_t hsize, name_size, size_vmcoreinfo_desc,
>>>> offset_vmcoreinfo;
>>>> +
>>>> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size,
>>>> &size_vmcoreinfo_desc);
>>>> +offset_vmcoreinfo = offset_note + s->note_size -
>>>> s->vmcoreinfo_size +
>>>> +(DIV_ROUND_UP(hsize, 4) + DIV_ROUND_UP(name_size, 4)) * 4;
>>>> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
>>>> +kh->size_vmcoreinfo = cpu_to_dump32(s, size_vmcoreinfo_desc);
>>>> +}
>>>> +
>>>>  kh->offset_note = cpu_to_dump64(s, offset_note);
>>>>  kh->note_size = cpu_to_dump32(s, s->note_size);
>>>>
>>>> @@ -939,6 +949,16 @@ static void create_header64(DumpState *s, Error
>>>> **errp)
>>>>  kh->dump_level = cpu_to_dump32(s, DUMP_LEVEL);
>>>>
>>>>  offset_note = DISKDUMP_HEADER_BLOCKS * block_size + size;
>>>> +if (s->vmcoreinfo) {
>>>> +uint64_t hsize, name_size, size_vmcoreinfo_desc,
>>>> offset_vmcoreinfo;
>>>> +
>>>> +get_note_sizes(s, s->vmcoreinfo, &hsize, &name_size,
>>>> &size_vmcoreinfo_desc);
>>>> +offset_vmcoreinfo = offset_note + s->note_size -
>>>> s->vmcoreinfo_size +
>>>> +(DIV_ROUND_UP(hsize, 4) + DIV_ROUND_UP(name_size, 4)) * 4;
>>>> +kh->offset_vmcoreinfo = cpu_to_dump64(s, offset_vmcoreinfo);
>>>> +kh->size_vmcoreinfo = cpu_to_dump64(s, size_vmcoreinfo_desc);
>>>> +}
>>>> +
>>>>  kh->offset_note = cpu_to_dump64(s, offset_note);
>>>>  kh->note_size = cpu_to_dump64(s, s->note_size);
>>>>
>>>>
>>>
>>> I continue to think that this patch is unnecessary, but if you insist,
> 
>>> it does look OK to me.
>>>
>>> Reviewed-by: Laszlo Ersek 
>>
>> Without it, crash doesn't read the vmcoreinfo PT_NOTE. And for some
>> reason, the phys_base in the header wasn't enough (to be doubled
>> checked).
>>
>> Any comment Dave about crash handling of vmcoreinfo in kdump files?
> 
> It just reads the kdump_sub_header's offset_vmcoreinfo and size_vmcoreinfo
> fields to gather the vmcoreinfo data into a local buffer of memory, and
> scans the strings for whatever it's looking for. 
> 
> With respect to phys_base, the only thing that might be of consequence
> is this fairly recent change that's currently only in the github repo,
> queued for crash-7.2.0:
> 
>   commit a4a538caca140a8e948bbdae2be311168db7a1eb
>   Author: Dave Anderson 
>   Date:   Tue May 2 16:51:53 2017 -0400
> 
> Fix for Linux 4.10 and later kdump dumpfiles, or kernels that have
> backported commit 401721ecd1dcb0a428aa5d6832ee05ffbdbffbbe, titled
> "kexec: export the value of phys_base instead of symbol address".
> Without the patch, if the x86_64 "phys_base" value in the VMCOREINFO
> note is a negative decimal number, the crash session fails during
> session intialization with a "page excluded" or "seek error" when
> reading "page_offset_base".
> (ander...@redhat.com)
> 
> Also, crash-7.1.9 was the first version that started looking in the 
> vmcoreinfo data for phys_base instead of in the kdump_sub_header.

OK, if crash (or earlier versions of crash) need this QEMU patch, then
I'm fine with it -- my R-b stands.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support

2017-07-07 Thread Laszlo Ersek
On 07/07/17 10:32, gengdongjiu wrote:
> Hi Laszlo,
>sorry for my late response.
> 
> On 2017/6/3 20:01, Laszlo Ersek wrote:
>> On 05/22/17 16:23, Laszlo Ersek wrote:
>>> Keeping some context:
>>>
>>> On 05/12/17 23:00, Laszlo Ersek wrote:
>>>> On 04/30/17 07:35, Dongjiu Geng wrote:
>>
>>> (68) In the code below, you are not taking an "OVMF header probe
>>> suppressor" into account.
>>>
>>> But, we have already planned to replace that quirk with a separate,
>>> dedicated allocation hint or command, so I'm not going to describe what
>>> an "OVMF header probe suppressor" is; instead, I'll describe the
>>> replacement for it.
>>>
>>> [...]
>>
>> So, the NOACPI allocation hint is a no-go at the moment, based on the
>> discussion in the following threads:
>>
>> http://mid.mail-archive.com/20170601112241.2580-1-ard.biesheuvel@linaro.org
>>
>> http://mid.mail-archive.com/c76b36de-ebf9-c662-d454-0a95b43901e8@redhat.com
>>
>> Therefore the header probe suppression remains necessary.
>>
>> In this case, it is not hard to do, you just have to reorder the
>> following two ADD_POINTER additions a bit:
>  Ok, it is no problem.
> 
>>
>>>>> +
>>>>> +bios_linker_loader_add_pointer(linker, GHES_ERRORS_FW_CFG_FILE,
>>>>> +sizeof(uint64_t) * i, sizeof(uint64_t),
>>>>> +GHES_ERRORS_FW_CFG_FILE,
>>>>> +MAX_ERROR_SOURCE_COUNT_V6 * 
>>>>> sizeof(uint64_t) +
>>>>> +i * MAX_RAW_DATA_LENGTH);
>>
>> This one should be moved out to a separate loop, after the current loop.
>>
>>>>> +bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
>>>>> +address_registers_offset
>>>>> ++ i * sizeof(AcpiGenericHardwareErrorSource),
>>>>> +sizeof(uint32_t), GHES_ERRORS_FW_CFG_FILE,
>>>>> +i * sizeof(uint64_t));
>>
>> This one should be kept in the first (i.e., current) loop.
>>
>> The idea is, when you first point the HEST/GHES_n entries in
>> ACPI_BUILD_TABLE_FILE to the "address registers" in
>> GHES_ERRORS_FW_CFG_FILE, all those address registers will still be
>> zero-filled. This will fail the ACPI table header probe in
>> OvmfPkg/AcpiPlatformDxe, which is what we want.
>>
>> After this is done, the address registers in GHES_ERRORS_FW_CFG_FILE
>> should be pointed to the error status data blocks in the same fw_cfg
>> blob, in a separate loop. (Those error status data blocks will again be
>> zero-filled, so no ACPI table headers will be mistakenly recognized in
>> them.)
> I understand your idear. but I have a question:
> how about we exchange the two function's place, such as shown below:
> then it still meets ours needs, the change is easy.
> For every loop:
> (1)when patch address in ACPI_BUILD_TABLE_FILE to the "address registers", 
> the address register is zero-filed.
> (2)when patch address in GHES_ERRORS_FW_CFG_FILE to the error status data 
> blocks, the error status data block is still zero-filed.
> 
> for (i = 0; i < GHES_ACPI_HEST_NOTIFY_RESERVED; i++) {
> .
> bios_linker_loader_add_pointer(linker, ACPI_BUILD_TABLE_FILE,
> address_registers_offset
> + i * sizeof(AcpiGenericHardwareErrorSource),
> sizeof(uint32_t), GHES_ERRORS_FW_CFG_FILE,
> i * sizeof(uint64_t));
> 
> 
> bios_linker_loader_add_pointer(linker, GHES_ERRORS_FW_CFG_FILE,
> sizeof(uint64_t) * i, sizeof(uint64_t),
> GHES_ERRORS_FW_CFG_FILE,
> MAX_ERROR_SOURCE_COUNT_V6 * sizeof(uint64_t) +
> i * MAX_RAW_DATA_LENGTH);
>   .
> 
>  }

Your suggestion seems to do the same, but there is a subtle difference.

When the firmware scans the targets of the ADD_POINTER commands for byte
sequences that "look like" an ACPI table header, in order to suppress
that probe, we should keep 36 bytes (i.e., the size of the ACPI table
header structure) zeroed at the target location.

* In the patch, as posted, this was not the case, because it first
filled in the address register inside the GHES_ERRORS_FW_CFG

Re: [Qemu-devel] [PATCH v2 2/7] acpi: add vmcoreinfo device

2017-07-07 Thread Laszlo Ersek
Marc-André,

sorry about this, but I have another late comment (I shoudl have pointed
this out in v1). Regarding:

On 07/06/17 12:16, Marc-André Lureau wrote:

> +#define VMCOREINFO_OFFSET   40   /* allow space for
> +  * OVMF SDT Header Probe Supressor
> +  */

and

> +Method (ADDR, 0, NotSerialized)
> +{
> +Local0 = Package (0x02) {}
> +Local0 [Zero] = (VCIA + 0x28)
> +Local0 [One] = Zero
> +Return (Local0)
> +}

and

> +In order to implement an OVMF "SDT Header Probe Suppressor", the contents of
> +the vmcoreinfo blob has 40 bytes of padding:
> +
> ++---+
> +| SSDT with OEM Table ID = VMCOREIN |
> ++---+
> +| ...   |   TOP OF PAGE
> +| VCIA dword object |-> +---+
> +| ...   |   | fw-allocated array for|
> +| _STA method referring to VCIA |   | "etc/vmcoreinfo"  |
> +| ...   |   +---+
> +| ADDR method referring to VCIA |   |  0: OVMF SDT Header probe |
> +| ...   |   | suppressor|
> ++---+   | 40: uint32 version field  |
> +| 44: info contents |
> +|   |
> ++---+
> +END OF PAGE

Please define the VMCOREINFO_OFFSET macro like this:

  #define VMCOREINFO_OFFSET sizeof(AcpiTableHeader)

and then please refresh the documentation as well:
- replace decimal "40" with "36",
- replace 0x28 in the dumped SSDT with 0x24.

Namely, in order to suppress the OVMF ACP Table Header Probe -- "SDT"
simply means "System Description Table" --, it's enough to add
zero-padding that *precisely* covers such a header.

Given that we have a typedef for this header in QEMU, we should use it
in the definition of VMCOREINFO_OFFSET.

Now, the reason why VMGENID uses 40 instead comes from another
requirement: the VMGENID GUID has to be aligned at 8 bytes. (See
requirement "R1a" in "docs/specs/vmgenid.txt".) Therefore the directly
necessary 36 bytes of padding are rounded up to 40. See again
"docs/specs/vmgenid.txt":

--+
| SSDT with OEM Table ID = VMGENID |
+--+
| ...  |   TOP OF PAGE
| VGIA dword object ---|-> +---+
| ...  |   | fw-allocated array for|
| _STA method referring to VGIA|   | "etc/vmgenid_guid"|
| ...  |   +---+
| ADDR method referring to VGIA|   |  0: OVMF SDT Header probe |
| ...  |   | suppressor|
+--+   | 36: padding for 8-byte|
   | alignment |
   | 40: GUID  |
   | 56: padding to page size  |
   +---+
   END OF PAGE

At offset 36 of "etc/vmgenid_guid", it says "padding for 8-byte alignment".

For VMCOREINFO, you don't need this extra alignment to 8 bytes, before
the "version" field is listed. So please make the VMCOREINFO_OFFSET
reflect what's actually necessary.

The rest of the code doesn't have to be modified (including your
experimental guest kernel driver); changing VMCOREINFO_OFFSET will
update all necessary locations automatically, including the generated
AML. Only the docs have to be synced manually.

... In fact, there *is* yet another location to update: the test case. I
suggest to include "hw/acpi/vmcoreinfo.h" in "tests/vmcoreinfo-test.c",
rather than open-code VMCOREINFO_OFFSET there. ... Oh wait: in the test
case you don't even use VMCOREINFO_OFFSET for anything. So just delete
the macro definition from that patch.

Thank you (and sorry about the churn),
Laszlo



Re: [Qemu-devel] [PATCH v2 2/7] acpi: add vmcoreinfo device

2017-07-07 Thread Laszlo Ersek
On 07/07/17 15:13, Laszlo Ersek wrote:
> Marc-André,
> 
> sorry about this, but I have another late comment (I shoudl have pointed
> this out in v1). Regarding:
> 
> On 07/06/17 12:16, Marc-André Lureau wrote:
> 
>> +#define VMCOREINFO_OFFSET   40   /* allow space for
>> +  * OVMF SDT Header Probe Supressor
>> +  */
> 
> and
> 
>> +Method (ADDR, 0, NotSerialized)
>> +{
>> +Local0 = Package (0x02) {}
>> +Local0 [Zero] = (VCIA + 0x28)
>> +Local0 [One] = Zero
>> +Return (Local0)
>> +}
> 
> and
> 
>> +In order to implement an OVMF "SDT Header Probe Suppressor", the contents of
>> +the vmcoreinfo blob has 40 bytes of padding:
>> +
>> ++---+
>> +| SSDT with OEM Table ID = VMCOREIN |
>> ++---+
>> +| ...   |   TOP OF PAGE
>> +| VCIA dword object |-> +---+
>> +| ...   |   | fw-allocated array for|
>> +| _STA method referring to VCIA |   | "etc/vmcoreinfo"  |
>> +| ...   |   +---+
>> +| ADDR method referring to VCIA |   |  0: OVMF SDT Header probe |
>> +| ...   |   | suppressor|
>> ++---+   | 40: uint32 version field  |
>> +| 44: info contents |
>> +|   |
>> ++---+
>> +END OF PAGE
> 
> Please define the VMCOREINFO_OFFSET macro like this:
> 
>   #define VMCOREINFO_OFFSET sizeof(AcpiTableHeader)
> 
> and then please refresh the documentation as well:
> - replace decimal "40" with "36",

also replace decimal "44", the offset of "info contents", with "40"...
you get the idea.

Thanks
Laszlo

> - replace 0x28 in the dumped SSDT with 0x24.
> 
> Namely, in order to suppress the OVMF ACP Table Header Probe -- "SDT"
> simply means "System Description Table" --, it's enough to add
> zero-padding that *precisely* covers such a header.
> 
> Given that we have a typedef for this header in QEMU, we should use it
> in the definition of VMCOREINFO_OFFSET.
> 
> Now, the reason why VMGENID uses 40 instead comes from another
> requirement: the VMGENID GUID has to be aligned at 8 bytes. (See
> requirement "R1a" in "docs/specs/vmgenid.txt".) Therefore the directly
> necessary 36 bytes of padding are rounded up to 40. See again
> "docs/specs/vmgenid.txt":
> 
> --+
> | SSDT with OEM Table ID = VMGENID |
> +--+
> | ...  |   TOP OF PAGE
> | VGIA dword object ---|-> +---+
> | ...  |   | fw-allocated array for|
> | _STA method referring to VGIA|   | "etc/vmgenid_guid"|
> | ...  |   +---+
> | ADDR method referring to VGIA|   |  0: OVMF SDT Header probe |
> | ...  |   | suppressor|
> +--+   | 36: padding for 8-byte|
>| alignment |
>| 40: GUID  |
>| 56: padding to page size  |
>+---+
>END OF PAGE
> 
> At offset 36 of "etc/vmgenid_guid", it says "padding for 8-byte alignment".
> 
> For VMCOREINFO, you don't need this extra alignment to 8 bytes, before
> the "version" field is listed. So please make the VMCOREINFO_OFFSET
> reflect what's actually necessary.
> 
> The rest of the code doesn't have to be modified (including your
> experimental guest kernel driver); changing VMCOREINFO_OFFSET will
> update all necessary locations automatically, including the generated
> AML. Only the docs have to be synced manually.
> 
> ... In fact, there *is* yet another location to update: the test case. I
> suggest to include "hw/acpi/vmcoreinfo.h" in "tests/vmcoreinfo-test.c",
> rather than open-code VMCOREINFO_OFFSET there. ... Oh wait: in the test
> case you don't even use VMCOREINFO_OFFSET for anything. So just delete
> the macro definition from that patch.
> 
> Thank you (and sorry about the churn),
> Laszlo
> 




Re: [Qemu-devel] [PATCHv7 5/6] fw_cfg: move qdev_init_nofail() from fw_cfg_init1() to callers

2017-07-07 Thread Laszlo Ersek
On 07/07/17 21:54, Eduardo Habkost wrote:
> On Fri, Jul 07, 2017 at 04:30:29PM -0300, Eduardo Habkost wrote:
>> On Fri, Jul 07, 2017 at 08:18:20PM +0200, Igor Mammedov wrote:
>>> On Fri, 7 Jul 2017 12:09:56 -0300
>>> Eduardo Habkost  wrote:
>>>
 On Fri, Jul 07, 2017 at 04:58:17PM +0200, Igor Mammedov wrote:
> On Fri, 7 Jul 2017 10:13:00 -0300
> "Eduardo Habkost"  wrote:
 [...]
>> I don't disagree with adding the assert(), but it looks like
>> making fw_cfg_find() return NULL if there are multiple devices
>> can be useful for realize.
>>
>> In this case, it looks like Mark is relying on that in
>> fw_cfg_common_realize(): if multiple devices are created,
>> fw_cfg_find() will return NULL, and realize will fail.  This
>> sounds like a more graceful way to handle multiple-device
>> creation than crashing on fw_cfg_find().  This is the solution
>> used by find_vmgenid_dev()/vmgenid_realize(), BTW.
>
> I suspect that find_vmgenid_dev() works by luck as it could be
> placed only as /machine/peripheral-anon/foo1 or /machine/peripheral/foo2
>   object_resolve_partial_path() : machine
>object_resolve_partial_path() : peripheral-anon => foo1
>object_resolve_partial_path() : peripheral => foo2
>if (found /* foo2 */) {
>if (obj /* foo1 */) {
>return NULL;

 I don't think this is luck: object_resolve_partial_path() is
 explicitly documented to always return NULL if multiple matches
 are found, and I don't see any bug in its implementation that
 would break that functionality.
>>>
>>> Maybe I'm reading it wrong, but consider following:
>>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg460692.html
>>>
>>> it looks to me that using ambiguous argument is necessary for
>>> duplicate detection to work correctly.
>>
>> Oh, good catch, I think I see the bug now.  We need to write a
>> test case and fix that.
> 
> I could reproduce it with the following test case.
> 
> Signed-off-by: Eduardo Habkost 
> ---
> diff --git a/tests/check-qom-proplist.c b/tests/check-qom-proplist.c
> index 8e432e9..c320cff 100644
> --- a/tests/check-qom-proplist.c
> +++ b/tests/check-qom-proplist.c
> @@ -568,6 +568,29 @@ static void test_dummy_delchild(void)
>  object_unparent(OBJECT(dev));
>  }
>  
> +static void test_qom_partial_path(void)
> +{
> +Object *root = object_get_objects_root();
> +Object *o   = object_new(TYPE_DUMMY);
> +Object *o1_1 = object_new(TYPE_DUMMY);
> +Object *o1_2 = object_new(TYPE_DUMMY);
> +Object *o2   = object_new(TYPE_DUMMY);
> +
> +object_property_add_child(root, "o", o, &error_abort);
> +object_property_add_child(o, "o1", o1_1, &error_abort);
> +object_property_add_child(o, "o2", o1_2, &error_abort);
> +object_property_add_child(root, "o2", o2, &error_abort);
> +
> +g_assert(!object_resolve_path_type("", TYPE_DUMMY, NULL));
> +g_assert(!object_resolve_path("o2", NULL));
> +g_assert(object_resolve_path("o1", NULL) == o1_1);
> +
> +object_unref(o);
> +object_unref(o2);
> +object_unref(o1_1);
> +object_unref(o1_2);
> +}
> +
>  int main(int argc, char **argv)
>  {
>  g_test_init(&argc, &argv, NULL);
> @@ -585,6 +608,7 @@ int main(int argc, char **argv)
>  g_test_add_func("/qom/proplist/getenum", test_dummy_getenum);
>  g_test_add_func("/qom/proplist/iterator", test_dummy_iterator);
>  g_test_add_func("/qom/proplist/delchild", test_dummy_delchild);
> +g_test_add_func("/qom/path/resolve/partial", test_qom_partial_path);
>  
>  return g_test_run();
>  }
> 

Meta comment: variable names like "o" and "l" are terrible (they look
like 0 and 1, dependent on your font). I suggest "obj".

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v4 1/3] ACPI: Add new ACPI structures and macros

2017-07-10 Thread Laszlo Ersek
On 07/10/17 15:54, Eric Blake wrote:
> On 07/10/2017 04:25 AM, Dongjiu Geng wrote:
>> (1) Add related APEI/HEST table structures and  macros, these
>> definition refer to ACPI 6.1 and UEFI 2.6 spec.
> 
> Your mail is missing In-Reply-To: and References: headers, which makes
> it appear as a new top-level thread for each patch.  You'll want to
> figure out why you didn't properly link back to your 0/3 cover letter
> with Message-ID: <1499678736-5244-1-git-send-email-gengdong...@huawei.com>
> 

Yes, the git-send-email options to use are:

  --thread --no-chain-reply-to

Best to set them permanently in one's qemu clone,

$ git config --bool sendemail.thread   true
$ git config --bool sendemail.chainreplyto false

This is called "shallow threading" (see git-send-email(1)).

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v2 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-11 Thread Laszlo Ersek
On 07/11/17 12:04, Marc-André Lureau wrote:
> Hi
>
> - Original Message -
>> On 07/06/17 12:16, Marc-André Lureau wrote:
>>> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
>>>
>>> To help the python script, add a little global vmcoreinfo_gdb
>>> structure, that is populated with vmcoreinfo_gdb_update().
>>>
>>> Signed-off-by: Marc-André Lureau 

>>> @@ -181,6 +183,7 @@ static void vmcoreinfo_realize(DeviceState *dev, Error
>>> **errp)
>>>  return;
>>>  }
>>>
>>> +vmcoreinfo_gdb_helper = VMCOREINFO(dev);
>>>  qemu_register_reset(vmcoreinfo_handle_reset, dev);
>>>  }
>>>
>>>
>>
>> I guess we don't build QEMU with link-time optimization at the
>> moment.
>>
>> With link-time optimization, I think gcc might reasonably optimize
>> away the assignment to "vmcoreinfo_gdb_helper", and
>> "vmcoreinfo_gdb_helper" itself. This is why I suggested "volatile":
>>
>> static VMCoreInfoState * volatile vmcoreinfo_gdb_helper;
>>
>> Do you think volatile is only superfluous, or do you actively dislike
>> it for some reason?
>
> Yeah, I am not convinced volatile is the best way, but nor is static.
>
> Let's export it?

Volatile guarantees that the assignment will take place according to the
behavior of the "abstract C machine" described by ISO C. From ISO C99,

  6.7.3 Type qualifiers

  6 An object that has volatile-qualified type may be modified in ways
unknown to the implementation or have other unknown side effects.
Therefore any expression referring to such an object shall be
evaluated strictly according to the rules of the abstract machine,
as described in 5.1.2.3. Furthermore, at every sequence point the
value last stored in the object shall agree with that prescribed by
the abstract machine, except as modified by the unknown factors
mentioned previously. [116] What constitutes an access to an object
that has volatile-qualified type is implementation-defined.

  Footnote 116: A volatile declaration may be used to describe an object
corresponding to a memory-mapped input/output port or an
object accessed by an asynchronously interrupting
function. Actions on objects so declared shall not be
"optimized out" by an implementation or reordered except
as permitted by the rules for evaluating expressions.

So basically if you can find the symbol somehow, it will always have the
right value, due to the volatile qualifier, regardless of the
optimizations the compiler did.

Internal vs. external linkage ("static" vs. "extern") is a different
question; it might affect whether you find the symbol at all. IME,
symbols for objects with internal linkage are preserved, if: (a) you
don't strip the final binary, and (b) gcc doesn't optimize the variable
away.

With link-time / full program optimization, gcc is entitled to optimize
away variables with external linkage too, so "extern" in itself is not
bulletproof. Volatile is more important.

Given volatile, I'm sort of neutral on extern vs. static. However, if
you make the variable extern, that makes abuse from other source files
easier. (The variable should only be looked at with gdb.) This is also
why I originally suggested to limit the scope of even the static
variable to function scope -- this way abuse would completely be
prevented (nothing else could access the variable even from the same
translation unit), and gdb would still find the variable (IME).

Thanks
Laszlo



Re: [Qemu-devel] [PATCH resend v4 0/3] Generate APEI GHES table and dynamically record CPER

2017-07-11 Thread Laszlo Ersek
Hi Dongjiu,

On 07/11/17 08:46, Dongjiu Geng wrote:
> [...]

So my followup is off-topic, but I'd like to point out that the patch /
email threading in this series is still incorrect.

These are the "sent" timestamps on the messages:

Date: Tue, 11 Jul 2017 14:46:19 +0800
Date: Tue, 11 Jul 2017 14:46:31 +0800 (+12 seconds)
Date: Tue, 11 Jul 2017 14:46:41 +0800 (+10 seconds)
Date: Tue, 11 Jul 2017 14:47:00 +0800 (+19 seconds)

This tells me that you are mailing out the patches one by one. That's
not how most people post their patches.

(Side remark: while git-send-email can still get the threading right
with individual posting, for that you would have to provide the first
email's Message-Id individually to all the subsequent commands, and the
emails show that this didn't happen.)

Instead, you should invoke git-send-email with all the messages *at
once*. Then git-send-email can set up the threading automatically. From
git-send-email(1):

> GIT-SEND-EMAIL(1)Git ManualGIT-SEND-EMAIL(1)
>
> NAME
>git-send-email - Send a collection of patches as emails
>
> SYNOPSIS
>git send-email [options] ...
>git send-email --dump-aliases
>
> DESCRIPTION
>Takes the patches given on the command line and emails them
>out. Patches can be specified as files, directories (which
>will send all files in the directory), or directly as a
>revision list. In the last case, any format accepted by git-
>format-patch(1) can be passed to git send-email.

So send the patches with

  git send-email *.patch

or put all the patches into a temporary directory, and run

  git send-email patch-dir/

Thanks,
Laszlo



Re: [Qemu-devel] [PATCH v2 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-11 Thread Laszlo Ersek
On 07/11/17 15:35, Marc-André Lureau wrote:
> Hi
> 
> - Original Message -
>> On 07/11/17 12:04, Marc-André Lureau wrote:
>>> Hi
>>>
>>> - Original Message -
 On 07/06/17 12:16, Marc-André Lureau wrote:
> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
>
> To help the python script, add a little global vmcoreinfo_gdb
> structure, that is populated with vmcoreinfo_gdb_update().
>
> Signed-off-by: Marc-André Lureau 
>>
> @@ -181,6 +183,7 @@ static void vmcoreinfo_realize(DeviceState *dev,
> Error
> **errp)
>  return;
>  }
>
> +vmcoreinfo_gdb_helper = VMCOREINFO(dev);
>  qemu_register_reset(vmcoreinfo_handle_reset, dev);
>  }
>
>

 I guess we don't build QEMU with link-time optimization at the
 moment.

 With link-time optimization, I think gcc might reasonably optimize
 away the assignment to "vmcoreinfo_gdb_helper", and
 "vmcoreinfo_gdb_helper" itself. This is why I suggested "volatile":

 static VMCoreInfoState * volatile vmcoreinfo_gdb_helper;

 Do you think volatile is only superfluous, or do you actively dislike
 it for some reason?
>>>
>>> Yeah, I am not convinced volatile is the best way, but nor is static.
>>>
>>> Let's export it?
>>
>> Volatile guarantees that the assignment will take place according to the
>> behavior of the "abstract C machine" described by ISO C. From ISO C99,
>>
>>   6.7.3 Type qualifiers
>>
>>   6 An object that has volatile-qualified type may be modified in ways
>> unknown to the implementation or have other unknown side effects.
>> Therefore any expression referring to such an object shall be
>> evaluated strictly according to the rules of the abstract machine,
>> as described in 5.1.2.3. Furthermore, at every sequence point the
>> value last stored in the object shall agree with that prescribed by
>> the abstract machine, except as modified by the unknown factors
>> mentioned previously. [116] What constitutes an access to an object
>> that has volatile-qualified type is implementation-defined.
>>
>>   Footnote 116: A volatile declaration may be used to describe an object
>> corresponding to a memory-mapped input/output port or an
>> object accessed by an asynchronously interrupting
>> function. Actions on objects so declared shall not be
>> "optimized out" by an implementation or reordered except
>> as permitted by the rules for evaluating expressions.
>>
>> So basically if you can find the symbol somehow, it will always have the
>> right value, due to the volatile qualifier, regardless of the
>> optimizations the compiler did.
>>
>> Internal vs. external linkage ("static" vs. "extern") is a different
>> question; it might affect whether you find the symbol at all. IME,
>> symbols for objects with internal linkage are preserved, if: (a) you
>> don't strip the final binary, and (b) gcc doesn't optimize the variable
>> away.
>>
>> With link-time / full program optimization, gcc is entitled to optimize
>> away variables with external linkage too, so "extern" in itself is not
>> bulletproof. Volatile is more important.
> 
> Ok, you seem confident about this sort of things, I am not, I'll follow you :)
>>
> 
>> Given volatile, I'm sort of neutral on extern vs. static. However, if
>> you make the variable extern, that makes abuse from other source files
>> easier. (The variable should only be looked at with gdb.) This is also
>> why I originally suggested to limit the scope of even the static
>> variable to function scope -- this way abuse would completely be
>> prevented (nothing else could access the variable even from the same
>> translation unit), and gdb would still find the variable (IME).
> 
> How do you access a static inside a function? Gdb can look it up but 
> complains that it's not in current context.
> 

According to ,
the magic incantation is

  function::variable

If this doesn't work (in case the realize function isn't on the stack at
all), then I agree the variable should be made file scope.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v2 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-11 Thread Laszlo Ersek
On 07/11/17 15:58, Marc-André Lureau wrote:

>>> How do you access a static inside a function? Gdb can look it up but
>>> complains that it's not in current context.
>>>
>>
>> According to ,
>> the magic incantation is
>>
>>   function::variable
>>
> 
> Thanks that works! I'll update the patch

Great!

> 
>> If this doesn't work (in case the realize function isn't on the stack at
>> all), then I agree the variable should be made file scope.
>>
> 
> Why would it matter if realize is on the stack? It's likely not :)

I agree the realize function will likely not be on the stack, and I also
agree that it should not matter. However, the gdb docs seemed a bit
unclear to me regarding this, so I wasn't 100% sure without trying.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v2] specs: Describe the TPM support in QEMU

2017-07-11 Thread Laszlo Ersek
On 07/11/17 16:31, Stefan Berger wrote:
> This patch adds a description of the current TPM support in QEMU
> to the specs.
> 
> Several public specs are referenced via their landing page on the
> trustedcomputinggroup.org website.
> 
> Signed-off-by: Stefan Berger 
> 
> ---
> 
> v1->v2:
>   - fixed typos
>   - added command line for starting an x86_64 VM with TPM passthrough device
>   - added command lines for checks inside the VM
> ---
>  docs/specs/tpm.txt | 124 
> +
>  1 file changed, 124 insertions(+)
>  create mode 100644 docs/specs/tpm.txt

Awesome, thank you very much!

I think I noticed one typo in new text:

> +#> dmesg | grep TCPA
> +[0.00] ACPI: TCP 0x03FFD191C 32 (v02 BOCHS  \
> +BXPCTCPA 001 BXPC 0001)

I think the prefix here should be "ACPI: TCPA"; the letter "A" probably
fell victim to wrapping the line nicely.

Not sure which maintainer will pick up the patch, but I think they can
fix up this typo on their end (assuming no other reviewer asks for v3).

Reviewed-by: Laszlo Ersek 

Thank you again, Stefan!
Laszlo



Re: [Qemu-devel] [PATCH v3 3/7] tests: add simple vmcoreinfo test

2017-07-11 Thread Laszlo Ersek
On 07/11/17 12:30, Marc-André Lureau wrote:
> This test is based off vmgenid test from Ben Warren
> . It simply checks the vmcoreinfo ACPI device
> is present and that the memory region associated can be read.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  tests/vmcoreinfo-test.c | 127 
> 
>  tests/Makefile.include  |   2 +
>  2 files changed, 129 insertions(+)
>  create mode 100644 tests/vmcoreinfo-test.c

Reviewed-by: Laszlo Ersek 



Re: [Qemu-devel] [PATCH v3 4/7] dump: add vmcoreinfo ELF note

2017-07-11 Thread Laszlo Ersek
4_t note_head_sz;
> +uint64_t name_sz;
> +uint64_t desc_sz;
> +
> +if (s->dump_info.d_class == ELFCLASS64) {
> +const Elf64_Nhdr *hdr = note;
> +note_head_sz = sizeof(Elf64_Nhdr);
> +name_sz = tswap64(hdr->n_namesz);
> +desc_sz = tswap64(hdr->n_descsz);
> +} else {
> +const Elf32_Nhdr *hdr = note;
> +note_head_sz = sizeof(Elf32_Nhdr);
> +name_sz = tswap32(hdr->n_namesz);
> +desc_sz = tswap32(hdr->n_descsz);
> +}
> +
> +if (note_head_size) {
> +*note_head_size = note_head_sz;
> +}
> +if (name_size) {
> +*name_size = name_sz;
> +}
> +if (desc_size) {
> +*desc_size = desc_sz;
> +}
> +}
> +
>  /* write common header, sub header and elf note to vmcore */
>  static void create_header32(DumpState *s, Error **errp)
>  {
> @@ -1488,10 +1554,40 @@ static int64_t dump_calculate_size(DumpState *s)
>  return total;
>  }
>  
> +static void vmcoreinfo_update_phys_base(DumpState *s)
> +{
> +uint64_t size, note_head_size, name_size, phys_base;
> +char **lines;
> +uint8_t *vmci;
> +size_t i;
> +
> +get_note_sizes(s, s->vmcoreinfo, ¬e_head_size, &name_size, &size);
> +note_head_size = ROUND_UP(note_head_size, 4);
> +name_size = ROUND_UP(name_size, 4);
> +vmci = s->vmcoreinfo + note_head_size + name_size;
> +*(vmci + size) = '\0';
> +
> +lines = g_strsplit((char *)vmci, "\n", -1);
> +for (i = 0; lines[i]; i++) {
> +if (g_str_has_prefix(lines[i], "NUMBER(phys_base)=")) {
> +if (qemu_strtou64(lines[i] + 18, NULL, 16,
> +  &phys_base) < 0) {
> +error_report("warning: Failed to read NUMBER(phys_base)=");

good change, adding "warning:" :)

> +} else {
> +s->dump_info.phys_base = phys_base;
> +}
> +break;
> +}
> +}
> +
> +g_strfreev(lines);
> +}
> +
>  static void dump_init(DumpState *s, int fd, bool has_format,
>DumpGuestMemoryFormat format, bool paging, bool 
> has_filter,
>int64_t begin, int64_t length, Error **errp)
>  {
> +Object *vmcoreinfo_dev = find_vmcoreinfo_dev();
>  CPUState *cpu;
>  int nr_cpus;
>  Error *err = NULL;
> @@ -1563,6 +1659,44 @@ static void dump_init(DumpState *s, int fd, bool 
> has_format,
>  goto cleanup;
>  }
>  
> +/*
> + * the goal of this block is to (a) update the previously guessed
> + * phys_base, (b) copy the vmcoreinfo note out of the guest. And
> + * that failure to do so is not fatal for dumping.
> + */

The words "And that" are not necessary here, they were only part of my
v2 commentary. Admittedly, my commentary should have been better formulated.

No need to repost because of this, the meaning is not harmed in any way.

> +if (vmcoreinfo_dev) {
> +uint64_t addr, note_head_size, name_size, desc_size;
> +uint32_t size;
> +
> +note_head_size = s->dump_info.d_class == ELFCLASS32 ?
> +sizeof(Elf32_Nhdr) : sizeof(Elf64_Nhdr);
> +
> +if (!vmcoreinfo_get(VMCOREINFO(vmcoreinfo_dev),
> +&addr, &size, &err)) {
> +error_report_err(err);
> +err = NULL;
> +} else if (size < note_head_size || size > MAX_VMCOREINFO_SIZE) {
> +error_report("warning: vmcoreinfo size is invalid: %" PRIu32, 
> size);
> +} else {
> +s->vmcoreinfo = g_malloc(size + 1); /* +1 for adding \0 */
> +cpu_physical_memory_read(addr, s->vmcoreinfo, size);
> +
> +get_note_sizes(s, s->vmcoreinfo, NULL, &name_size, &desc_size);
> +s->vmcoreinfo_size = ELF_NOTE_SIZE(note_head_size, name_size,
> +   desc_size);
> +if (name_size > MAX_VMCOREINFO_SIZE ||
> +desc_size > MAX_VMCOREINFO_SIZE ||
> +s->vmcoreinfo_size > size) {

Nice. ELF_NOTE_SIZE() is safe to use with those uint64_t variables. Even
if the result overflows (and therefore s->vmcoreinfo_size is
mathematically invalid), it is all well-defined in C, and we catch the
problem later, with the new individual checks. It's not a problem that
"s->vmcoreinfo_size" carries an invalid value temporarily.

This way you managed to reuse the same error handling block.

> +error_report("warning: Invalid vmcoreinfo header");
> +g_free(s->vmcoreinfo);
> +s->vmcoreinfo = NULL;
> +} else {
> +vmcoreinfo_update_phys_base(s);
> +s->note_size += s->vmcoreinfo_size;
> +}
> +}
> +}
> +
>  /* get memory mapping */
>  if (paging) {
>  qemu_get_guest_memory_mapping(&s->list, &s->guest_phys_blocks, &err);
> 

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v3 6/7] scripts/dump-guest-memory.py: add vmcoreinfo

2017-07-11 Thread Laszlo Ersek
On 07/11/17 12:30, Marc-André Lureau wrote:
> Add vmcoreinfo ELF note if vmcoreinfo device is ready.
> 
> To help the python script, add a little global vmcoreinfo_gdb
> structure, that is populated with vmcoreinfo_gdb_update().
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  scripts/dump-guest-memory.py | 46 
> 
>  hw/acpi/vmcoreinfo.c |  3 +++
>  2 files changed, 49 insertions(+)

... I've gotten a bit confused here, but I think this is what happened:
at 12:04 CEST today you commented on the "volatile thing"; at 12:30 CEST
you posted this v3 series, and I only followed up on your v2 comment at
15:25 CEST. So it's no surprise that whatever we discussed there can't
be seen in this patch.

So... IIUC our discussion there, you're going to post a v4 for this,
with a function-scoped, and internal-linkage, "vmcoreinfo_gdb_helper"
variable, also qualifying it "volatile", and accessing it with the
"function::variable" pattern from the python script. Is that about right?

One more comment below (actually two, but for one location):

> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
> index f7c6635f15..80730658ae 100644
> --- a/scripts/dump-guest-memory.py
> +++ b/scripts/dump-guest-memory.py
> @@ -14,6 +14,7 @@ the COPYING file in the top-level directory.
>  """
>  
>  import ctypes
> +import struct
>  
>  UINTPTR_T = gdb.lookup_type("uintptr_t")
>  
> @@ -120,6 +121,22 @@ class ELF(object):
>  self.segments[0].p_filesz += ctypes.sizeof(note)
>  self.segments[0].p_memsz += ctypes.sizeof(note)
>  
> +
> +def add_vmcoreinfo_note(self, vmcoreinfo):
> +"""Adds a vmcoreinfo note to the ELF dump."""
> +# compute the header size, and copy that many bytes from the note
> +header = get_arch_note(self.endianness, 0, 0)
> +ctypes.memmove(ctypes.pointer(header),
> +   vmcoreinfo, ctypes.sizeof(header))
> +# now get the full note
> +note = get_arch_note(self.endianness,
> + header.n_namesz - 1, header.n_descsz)
> +ctypes.memmove(ctypes.pointer(note), vmcoreinfo, ctypes.sizeof(note))
> +
> +self.notes.append(note)
> +self.segments[0].p_filesz += ctypes.sizeof(note)
> +self.segments[0].p_memsz += ctypes.sizeof(note)
> +
>  def add_segment(self, p_type, p_paddr, p_size):
>  """Adds a segment to the elf."""
>  
> @@ -505,6 +522,34 @@ shape and this command should mostly work."""
>  cur += chunk_size
>  left -= chunk_size
>  
> +def phys_memory_read(self, addr, size):
> +qemu_core = gdb.inferiors()[0]
> +for block in self.guest_phys_blocks:
> +if block["target_start"] <= addr < block["target_end"] \
> +   and addr + size < block["target_end"]:

Thanks for touching this up, but now I have two more new comments :)

First (and sorry about putting my request unclearly in the v2 review), I
think we need the following, and only the following checks here:
- "addr" against block["target_start"],
- "addr + size" against block["target_end"].

Second, if you are comparing limits of the same kind (that is, inclusive
vs. inclusive, and exclusive vs. exclusive), then equality is valid and
should be accepted. Therefore,

  block["target_start"] <= addr

is correct (exact match is valid), but

  addr + size < block["target_end"]

is incorrect (too strict), because "addr + size" is an exclusive limit
-- same as block["target_end"] -- so equality should again be accepted:

  addr + size <= block["target_end"]

If you clean these up, you can add my

Acked-by: Laszlo Ersek 

but I would still like a real Pythonista to review this patch. Adding
Janosch.

Janosch -- can you please help review this patch?

Thanks,
Laszlo


> +haddr = block["host_addr"] + (addr - block["target_start"])
> +return qemu_core.read_memory(haddr, size)
> +return None
> +
> +def add_vmcoreinfo(self):
> +if not gdb.parse_and_eval("vmcoreinfo_gdb_helper"):
> +return
> +
> +addr = gdb.parse_and_eval("vmcoreinfo_gdb_helper.vmcoreinfo_addr_le")
> +addr = bytes([addr[i] for i in range(4)])
> +addr = struct.unpack(" +
> +mem = self.phys_memory_read(addr, 16)
> +if not mem:
> +return
> +(version, addr, size) =

Re: [Qemu-devel] [PATCH v3 0/7] KASLR kernel dump support

2017-07-11 Thread Laszlo Ersek
On 07/11/17 12:30, Marc-André Lureau wrote:

> v3: from Laszlo review
> - change vmcoreinfo offset to 36
> - reset err to null after report
> - use PRIu32
> - change name_size and desc_size against MAX_VMCOREINFO_SIZE
> - python code simplification
> - check boundaries of blocks in phys_memory_read()
> - fix some vmgi vs vmci names
> - add more comments in code
> - fix comment indentation
> - add r-b tags

I compared patches #1, #2, #5, #7 too, against v2 -- there have been
some justified changes (wherever appropriate); my earlier R-b / A-b tags
stand.

Thanks
Laszlo



Re: [Qemu-devel] [RFC PATCH v2 0/4] Allow RedHat PCI bridges reserve more buses than necessary during init

2017-07-25 Thread Laszlo Ersek
On 07/23/17 00:11, Aleksandr Bezzubikov wrote:
> Now PCI bridges get a bus range number on a system init, basing on
> currently plugged devices. That's why when one wants to hotplug
> another bridge, it needs his child bus, which the parent is unable to
> provide (speaking about virtual device). The suggested workaround is
> to have vendor-specific capability in Red Hat PCI bridges that
> contains number of additional bus to reserve on BIOS PCI init. So this
> capability is intented only for pure QEMU->SeaBIOS usage.
>
> Considering all aforesaid, this series is directly connected with
> QEMU RFC series (v2) "Generic PCIE-PCI Bridge".
>
> Although the new PCI capability is supposed to contain various limits
> along with bus number to reserve, now only its full layout is
> proposed, but only bus_reserve field is used in QEMU and BIOS. Limits
> usage is still a subject for implementation as now the main goal of
> this series to provide necessary support from the  firmware side to
> PCIE-PCI bridge hotplug.
>
> Changes v1->v2:
> 1. New #define for Red Hat vendor added (addresses Konrad's comment).
> 2. Refactored pci_find_capability function (addresses Marcel's
>comment).
> 3. Capability reworked:
>   - data type added;
>   - reserve space in a structure for IO, memory and
> prefetchable memory limits.
>
>
> Aleksandr Bezzubikov (4):
>   pci: refactor pci_find_capapibilty to get bdf as the first argument
> instead of the whole pci_device
>   pci: add RedHat vendor ID
>   pci: add QEMU-specific PCI capability structure
>   pci: enable RedHat PCI bridges to reserve additional buses on PCI
> init
>
>  src/fw/pciinit.c| 18 ++
>  src/hw/pci_cap.h| 23 +++
>  src/hw/pci_ids.h|  2 ++
>  src/hw/pcidevice.c  | 12 ++--
>  src/hw/pcidevice.h  |  2 +-
>  src/hw/virtio-pci.c |  4 ++--
>  6 files changed, 48 insertions(+), 13 deletions(-)
>  create mode 100644 src/hw/pci_cap.h
>

Coming back from PTO, it's hard for me to follow up on all the comments
that have been made across the v1 and v2 of this RFC series, so I'll
just provide a brain dump here:

(1) Mentioned by Michael: documentation. That's the most important part.
I haven't seen the QEMU patches, so perhaps they already include
documentation. If not, please start this work with adding a detailed
description do QEMU's docs/ or docs/specs/.

There are a number of preexistent documents that might be related, just
search docs/ for filenames with "pci" in them.


(2) Bus range reservation, and hotplugging bridges. What's the
motivation? Our recommendations in "docs/pcie.txt" suggest flat
hierarchies.

If this use case is really necessary, I think it should be covered in
"docs/pcie.txt". In particular it has a consequence for PXB as well
(search "pcie.txt" for "bus_nr") -- if users employ extra root buses,
then the bus number partitions that they specify must account for any
bridges that they plan to hot-plug (and for the bus range reservations
on the cold-plugged bridges behind those extra root buses).


(3) Regarding the contents and the format of the capability structure, I
wrote up my thoughts earlier in

  https://bugzilla.redhat.com/show_bug.cgi?id=1434747#c8

Let me quote it here for ease of commenting:

> (In reply to Gerd Hoffmann from comment #7)
> > So, now that the generic ports are there we can go on figure how to
> > handle this best.  I still think the best way to communicate window
> > size hints would be to use a vendor specific pci capability (instead
> > of setting the desired size on reset).  The information will always
> > be available then and we don't run into initialization order issues.
>
> This seems good to me -- I can't promise 100% without actually trying,
> but I think I should be able to parse the capability list in config
> space for this hint, in the GetResourcePadding() callback.
>
> I propose that we try to handle this issue "holistically", together
> with bug 1434740. We need a method that provides controls for both IO
> and MMIO:
>
> - For IO, we need a mechanism that can prevent *both* firmware *and*
>   Linux from reserving IO for PCI Express ports. I think Marcel's
>   approach in bug 1344299 is sufficient, i.e., set the IO base/limit
>   registers of the bridge to 0 for disabling IO support. And, if not
>   disabled, just go with the default 4KB IO reservation (for both PCI
>   Express ports and legacy PCI bridges, as the latter is documented in
>   the guidelines).
>
> - For MMIO, the vendor specific capability structure should work
>   something like this:
> - if the capability is missing, reserve 2MB, 32-bit,
>   non-prefetchable,
>
> - otherwise, the capability structure should consist of 3 fields
>   (reservation sizes):
> - uint32_t non_prefetchable_32,
> - uint32_t prefetchable_32,
> - uint64_t prefetchable_64,
>
> - of prefetchable_32 and prefetchable_64, at most one may be
>   nonzero (they ar

Re: [Qemu-devel] Commit 77af8a2b95b79699de650965d5228772743efe84 breaks Windows 2000 support

2017-07-25 Thread Laszlo Ersek
On 07/21/17 20:29, Phil Dennis-Jordan wrote:
> On Fri, Jul 21, 2017 at 2:34 PM, Igor Mammedov  wrote:
>> On Fri, 21 Jul 2017 10:23:38 +0100
>> "Daniel P. Berrange"  wrote:
>>
>>> On Fri, Jul 21, 2017 at 11:06:36AM +0200, Igor Mammedov wrote:
 On Thu, 20 Jul 2017 21:29:33 +0200
 Phil Dennis-Jordan  wrote:

> On Thu, Jul 20, 2017 at 6:40 PM, Programmingkid
>  wrote:
>> I noticed that Windows 2000 does not boot up in QEMU recently. After 
>> bisecting the issue I found the offending commit:
 w2k is very ancient (and long time EOLed), I can't even download it from 
 msdn to test
 (oldest available is XP)

 do we really care about it?
>>>
>>> From a Red Hat, we don't care about it, because we're only targetting
>>> modern OS in RHEL, but from a QEMU community POV ability to run pretty
>>> much any guest OS you care to find is definitely in scope.
>> As far as someone is willing to maintain it and test it regularly,
>> otherwise it will beak someday anyway.
>> (I'm not really willing to do it as I don't have access to w2k and
>> interested in reducing maintainable code, but maybe someone would
>> like to step up, feel free to post patch to amend acpi maintaners)
>>
>> currently option 1 looks like the most compatible approach
>> but there is no way to predict if it will break some other OS
>> and it is not trivial to implement and maintain.
>>
>> CCing Laszlo, to get his opinion if option 1 is viable from
>> old/new OVMF standpoint (is it possible in 2.10 time frame?).
> 
> I've not done a deep investigation yet, but I've put together a really
> quick prototype for the split RSDT/XSDT with 2 FADTs. I tested my
> existing WinXP x86, Win10 x64, and Ubuntu 16.04 x86-64 test images
> with SeaBIOS and they all worked.
> 
> With OVMF, neither Windows 10 nor macOS would boot with this change -
> I don't currently know if that's OVMF's fault or if my prototype is
> broken. I don't have time right now to dig deeper into this, but
> hopefully I can look at it on Monday and also dig out a Win2000 disc
> then as well and test with that.

Thanks for the CC.

With edk2 you cannot install one FADT that is pointed-to by an RSDT
entry and another FADT that is pointed-to by an XSDT entry.

- The two FADT versions are at separate places in guest memory, so the
multiply-pointed-to table handling that we collaborated on last time
does not apply (justifiedly -- these are separate tables, not a single
table being the target of multiple pointers).

- This means that both FADT versions will be passed to
InstallAcpiTable(). The implementation in
"MdeModulePkg/Universal/Acpi/AcpiTableDxe/AcpiTableProtocol.c" handles
both RSDT and XSDT entries automatically (there's no way for OVMF to say
"I want this one linked into RSDT, and that one linked into XSDT"), so
the firstly installed FADT is linked into both root tables, and the
secondly installed FADT is rejected with EFI_ACCESS_DENIED.

- The failure in turn causes OVMF to roll back all the ACPI
linker/loader processing done thus far, and to fall back to the built-in
(very ancient) default tables. It's not surprising that modern OSes
don't boot with those tables.

So approach (1) cannot work with UEFI.

I could imagine approach (2) like this:

- continue with unversioned firmware
- introduce a master switch called "ancient ACPI" vs. "modern ACPI"
- tie this knob to machine types
- factor the "ancient ACPI" stuff out to a separate set of source files
- assign additional maintainers (like Igor suggests) to the "ancient
  ACPI" source files.

I agree with Igor 100% that the "ancient ACPI" stuff has to be
maintained by people that *actually use* OSes that require ancient ACPI.
In commit 77af8a2b95b7 ("hw/i386: Use Rev3 FADT (ACPI 2.0) instead of
Rev1 to improve guest OS support.", 2017-03-15), Phil wrote

  "No regressions became apparent in tests with a range of Windows
   (XP-10)"

In theory, w2k falls within that range. In practice, it is impossible to
test *all* Windows versions against ACPI generator changes, even if you
try to be thorough (which Phil was). One might not even *know about*
"all" Windows versions. So people using w2k and similar should
co-maintain the ACPI stuff and report back with testing on the fly;
otherwise regressions are impossible to avoid.

(Continuous integration covering *all* Windows versions is impossible
for obvious reasons ($$$).)

Thanks
Laszlo

> 
> The prototype patch is at https://github.com/pmj/qemu/tree/xsdt right
> now for anyone curious, or with more time on their hands to test it
> with Win2K and figure out why it's not working with OVMF. I'll try to
> do a proper RFC patch submission on Monday once I have a better handle
> on what's going on.
> 
> I don't have any strong release policy opinions - I'll leave that to
> those with more experience. I'd be disappointed though if we had to
> entirely revert the Rev3 FADT patch for 2.10.
> 
> Ouch. I reckon we have 2 options for fixing this:
>
> 1. 

Re: [Qemu-devel] Commit 77af8a2b95b79699de650965d5228772743efe84 breaks Windows 2000 support

2017-07-26 Thread Laszlo Ersek
Digressing:

On 07/26/17 10:53, Paolo Bonzini wrote:
> On 25/07/2017 23:25, Phil Dennis-Jordan wrote:
>> Thanks for this, Paolo. Very interesting idea.
>>
>> I couldn't get things working initially, but with a few fixups on the
>> SeaBIOS side I can boot both legacy and modern OSes. See comments
>> inline below for details on changes required.
>>
>> Successfully booted (only a brief test):
>> - Windows 2000
>> - Windows XP (32 bit)
>> - Windows 7 (32 bit)
>> - Windows 10 (64 bit, SeaBIOS)
>> - Windows 10 (64 bit, OVMF)
>> - macOS 10.12 (patched OVMF)
>
> Thanks Phil!  You unwittingly tested the compatibility path on all
> these OSes, since my QEMU patch forgot to setup rsdp->length,
> rsdp->revision and the extended checksum.  However, I've now tested
> Windows XP, Linux w/SeaBIOS, Linux w/patched SeaBIOS and Linux w/OVMF.
>
> I've now found out that edk2 contains similar logic.  It uses a PCD (a
> compile-time flag essentially) to choose between ACPI >= 2.0 tables or
> ACPI 1.0-compatible tables.  In the latter case, edk2 takes care of
> producing a v1 FADT if needed (similar to this patch) and linking the
> RSDT to it; otherwise it keeps whatever FADT was provided by platform
> code and produces an XSDT.

Not exactly; the PCD controls whether the EFI_ACPI_TABLE_PROTOCOL will
expose an RSDT, an XSDT, or both (with matching contents). The FADT
always comes from the specific edk2 platform (i.e., OVMF client code),
and it is not translated in any way, regardless of the PCD value.

>From "MdeModulePkg/MdeModulePkg.dec":

>   ## Indicates which ACPI versions are targeted by the ACPI tables exposed to 
> the OS
>   #  These values are aligned with the definitions in 
> MdePkg/Include/Protocol/AcpiSystemDescriptionTable.h
>   #   BIT 1 - EFI_ACPI_TABLE_VERSION_1_0B.
>   #   BIT 2 - EFI_ACPI_TABLE_VERSION_2_0.
>   #   BIT 3 - EFI_ACPI_TABLE_VERSION_3_0.
>   #   BIT 4 - EFI_ACPI_TABLE_VERSION_4_0.
>   #   BIT 5 - EFI_ACPI_TABLE_VERSION_5_0.
>   # @Prompt Exposed ACPI table versions.
>   
> gEfiMdeModulePkgTokenSpaceGuid.PcdAcpiExposedTableVersions|0x3E|UINT32|0x0001004c

The expectation is that the specific edk2 platform overrides this PCD at
build time (if necessary), and then goes on (at boot time) to install
ACPI tables -- using EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() -- that
actually match the PCD setting.

>From the "MdeModulePkg/Universal/Acpi/AcpiTableDxe/" driver's POV (that
is, from the EFI_ACPI_TABLE_PROTOCOL implementation's POV), the platform
controls *both* the PCD and the actually installed tables like the FADT,
so EFI_ACPI_TABLE_PROTOCOL expects the platform to make these
consistent.

The tiny little problem is that the PCD is a build-time flag, but QEMU
provides the FADT (and friends) at boot time, dynamically, in a format
that is essentially opaque to OVMF. So OVMF is sticking with the default
PCD (see above), resulting in both RSDT and XSDT root tables, regardless
of the contents of the FADT and friends.

A somewhat (but not too much) similar situation is with the SMBIOS
tables. The tables are composed / exported by QEMU over fw_cfg, and OVMF
/ AAVMF have to set some version-like PCDs that match the content:
- PcdSmbiosDocRev
- PcdSmbiosVersion

We do some ugly hacks in OVMF to ensure that these PCDs are set "in
time", before the generic "MdeModulePkg/Universal/SmbiosDxe" --
providing EFI_SMBIOS_PROTOCOL -- starts up and consumes the PCDs.
Namely, we have "OvmfPkg/Library/SmbiosVersionLib" which sets these PCDs
based on fw_cfg, and we link this library via NULL class resolution into
"MdeModulePkg/Universal/SmbiosDxe". So the PCDs will be set up just
before EFI_SMBIOS_PROTOCOL is initialized and provided. In turn,
"OvmfPkg/SmbiosPlatformDxe", which actually calls
EFI_SMBIOS_PROTOCOL.Add() on the tables provided by QEMU, has a depex on
EFI_SMBIOS_PROTOCOL -- first, this depex ensures that
EFI_SMBIOS_PROTOCOL can be used by "OvmfPkg/SmbiosPlatformDxe", but
second, the depex *also* ensures that the PCDs will have been set
correctly by the time "OvmfPkg/SmbiosPlatformDxe" calls
EFI_SMBIOS_PROTOCOL.Add() for the first time.

You might ask why we don't do the same in the ACPI case (i.e., for
PcdAcpiExposedTableVersions). It's due to the following differences:

- (less importantly,) "MdeModulePkg.dec" allows platforms to pick
  "dynamic" for PcdSmbiosDocRev and PcdSmbiosVersion, not just "fixed at
  build". IOW, MdeModulePkg already expects platforms to set the SMBIOS
  version PCDs dynamically, if those platforms can ensure the setting
  occurs "early enough".

- (more importantly,) the information needed by OVMF, for setting the
  SMBIOS version PCDs in "OvmfPkg/Library/SmbiosVersionLib", is readily
  available for parsing from the separate, dedicated fw_cfg file called
  "etc/smbios/smbios-anchor". In fact, OVMF doesn't use this file for
  anything else than grabbing the versions for the PCDs. The actual
  "anchor" table (the smbios entry point) is produced by the
  EFI_SMBIOS_PROTOCOL implementati

Re: [Qemu-devel] [RFC PATCH v2 0/4] Allow RedHat PCI bridges reserve more buses than necessary during init

2017-07-26 Thread Laszlo Ersek
On 07/26/17 08:48, Marcel Apfelbaum wrote:
> On 25/07/2017 18:46, Laszlo Ersek wrote:

[snip]

>> (2) Bus range reservation, and hotplugging bridges. What's the
>> motivation? Our recommendations in "docs/pcie.txt" suggest flat
>> hierarchies.
>>
> 
> It remains flat. You have one single PCIE-PCI bridge plugged
> into a PCIe Root Port, no deep nesting.
> 
> The reason is to be able to support legacy PCI devices without
> "committing" with a DMI-PCI bridge in advance. (Keep Q35 without)
> legacy hw.
> 
> The only way to support PCI devices in Q35 is to have them cold-plugged
> into the pcie.0 bus, which is good, but not enough for expanding the
> Q35 usability in order to make it eventually the default
> QEMU x86 machine (I know this is another discussion and I am in
> minority, at least for now).
> 
> The plan is:
> Start Q35 machine as usual, but one of the PCIe Root Ports includes
> hints for firmware needed t support legacy PCI devices. (IO Ports range,
> extra bus,...)
> 
> Once a pci device is needed you have 2 options:
> 1. Plug a PCIe-PCI bridge into a PCIe Root Port and the PCI device
>in the bridge.
> 2. Hotplug a PCIe-PCI bridge into a PCIe Root Port and then hotplug
>a PCI device into the bridge.

Thank you for the explanation, it makes the intent a lot clearer.

However, what does the hot-pluggability of the PCIe-PCI bridge buy us?
In other words, what does it buy us when we do not add the PCIe-PCI
bridge immediately at guest startup, as an integrated device?

Why is it a problem to "commit" in advance? I understand that we might
not like the DMI-PCI bridge (due to it being legacy), but what speaks
against cold-plugging the PCIe-PCI bridge either as an integrated device
in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI
bridge in a similarly cold-plugged PCIe root port?

I mean, in the cold-plugged case, you use up two bus numbers at the
most, one for the root port, and another for the PCIe-PCI bridge. In the
hot-plugged case, you have to start with the cold-plugged root port just
the same (so that you can communicate the bus number reservation *at
all*), and then reserve (= use up in advance) the bus number, the IO
space, and the MMIO space(s). I don't see the difference; hot-plugging
the PCIe-PCI bridge (= not committing in advance) doesn't seem to save
any resources.

I guess I would see a difference if we reserved more than one bus number
in the hotplug case, namely in order to support recursive hotplug under
the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat
hierarchy (ie the exercise is only for enabling legacy PCI endpoints,
not for recursive hotplug).  The PCIe-PCI bridge isn't a device that
does anything at all on its own, so why not just coldplug it? Its
resources have to be reserved in advance anyway.

So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup,
possibly even make it an integrated device, and then you don't need to
reserve bus numbers (and other apertures)".

Where am I wrong?

[snip]

>> (4) Whether the reservation size should be absolute or relative (raised
>> by Gerd). IIUC, Gerd suggests that the absolute aperture size should be
>> specified (as a minimum), not the increment / reservation for hotplug
>> purposes.
>>
>> The Platform Initialization Specification, v1.6, downloadable at
>> <http://www.uefi.org/specs>, writes the following under
>>
>>EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding()
>>
>> in whose implementation I will have to parse the values from the
>> capability structure, and return the appropriate representation to the
>> platform-independent PciBusDxe driver (i.e., the enumeration /
>> allocation agent):
>>
>>> The padding is returned in the form of ACPI (2.0 & 3.0) resource
>>> descriptors. The exact definition of each of the fields is the same as
>>> in the
>>> EFI_PCI_HOST_BRIDGE_RESOURCE_ALLOCATION_PROTOCOL.SubmitResources()
>>> function. See the section 10.8 for the definition of this function.
>>>
>>> The PCI bus driver is responsible for adding this resource request to
>>> the resource requests by the physical PCI devices. If Attributes is
>>> EfiPaddingPciBus, the padding takes effect at the PCI bus level. If
>>> Attributes is EfiPaddingPciRootBridge, the required padding takes
>>> effect at the root bridge level. For details, see the definition of
>>> EFI_HPC_PADDING_ATTRIBUTES in "Related Definitions" below.
>>
>> Emphasis on "*adding* this resource request to the resource requests by
>> the physical PCI devices".
>>
>> However... After checking

Re: [Qemu-devel] [RFC PATCH v2 0/4] Allow RedHat PCI bridges reserve more buses than necessary during init

2017-07-26 Thread Laszlo Ersek
On 07/26/17 18:22, Marcel Apfelbaum wrote:
> On 26/07/2017 18:20, Laszlo Ersek wrote:

[snip]

>> However, what does the hot-pluggability of the PCIe-PCI bridge buy us?
>> In other words, what does it buy us when we do not add the PCIe-PCI
>> bridge immediately at guest startup, as an integrated device?
>>  > Why is it a problem to "commit" in advance? I understand that we might
>> not like the DMI-PCI bridge (due to it being legacy), but what speaks
>> against cold-plugging the PCIe-PCI bridge either as an integrated device
>> in pcie.0 (assuming that is permitted), or cold-plugging the PCIe-PCI
>> bridge in a similarly cold-plugged PCIe root port?
>>
> 
> We want to keep Q35 clean, and for most cases we don't want any
> legacy PCI stuff if not especially required.
> 
>> I mean, in the cold-plugged case, you use up two bus numbers at the
>> most, one for the root port, and another for the PCIe-PCI bridge. In the
>> hot-plugged case, you have to start with the cold-plugged root port just
>> the same (so that you can communicate the bus number reservation *at
>> all*), and then reserve (= use up in advance) the bus number, the IO
>> space, and the MMIO space(s). I don't see the difference; hot-plugging
>> the PCIe-PCI bridge (= not committing in advance) doesn't seem to save
>> any resources.
>>
> 
> Is not about resources, more about usage model.
> 
>> I guess I would see a difference if we reserved more than one bus number
>> in the hotplug case, namely in order to support recursive hotplug under
>> the PCIe-PCI bridge. But, you confirmed that we intend to keep the flat
>> hierarchy (ie the exercise is only for enabling legacy PCI endpoints,
>> not for recursive hotplug).  The PCIe-PCI bridge isn't a device that
>> does anything at all on its own, so why not just coldplug it? Its
>> resources have to be reserved in advance anyway.
>>
> 
> Even if we prefer flat hierarchies, we should allow a sane nested
> bridges configuration, so we will some times reserve more than one.
> 
>> So, thus far I would say "just cold-plug the PCIe-PCI bridge at startup,
>> possibly even make it an integrated device, and then you don't need to
>> reserve bus numbers (and other apertures)".
>>
>> Where am I wrong?
>>
> 
> Nothing wrong, I am just looking for feature parity Q35 vs PC.
> Users may want to continue using [nested] PCI bridges, and
> we want the Q35 machine to be used by more users in order
> to make it reliable faster, while keeping it clean by default.
> 
> We had a discussion on this matter on last year KVM forum
> and the hot-pluggable PCIe-PCI bridge was the general consensus.

OK. I don't want to question or go back on that consensus now; I'd just
like to point out that all that you describe (nested bridges, and
enabling legacy PCI with PCIe-PCI bridges, *on demand*) is still
possible with cold-plugging.

I.e., the default setup of Q35 does not need to include legacy PCI
bridges. It's just that the pre-launch configuration effort for a Q35
user to *reserve* resources for legacy PCI is the exact same as the
pre-launch configuration effort to *actually cold-plug* the bridge.

[snip]

>>>> The PI spec says,
>>>>
>>>>> [...] For all the root HPCs and the nonroot HPCs, call
>>>>> EFI_PCI_HOT_PLUG_INIT_PROTOCOL.GetResourcePadding() to obtain the
>>>>> amount of overallocation and add that amount to the requests from the
>>>>> physical devices. Reprogram the bus numbers by taking into account the
>>>>> bus resource padding information. [...]
>>>>
>>>> However, according to my interpretation of the source code, PciBusDxe
>>>> does not consider bus number padding for non-root HPCs (which are "all"
>>>> HPCs on QEMU).
>>>>
>>>
>>> Theoretically speaking, it is possible to change the  behavior, right?
>>
>> Not just theoretically; in the past I have changed PciBusDxe -- it
>> wouldn't identify QEMU's hotplug controllers (root port, downstream port
>> etc) appropriately, and I managed to get some patches in. It's just that
>> the less we understand the current code and the more intrusive/extensive
>> the change is, the harder it is to sell the *idea*. PciBusDxe is
>> platform-independent and shipped on many a physical system too.
>>
> 
> Understood, but from your explanation it sounds like the existings
> callback sites(hooks) are enough.

That's the problem: they don't appear to, if you consider bus number
reservations. The existing callback sites seem fine regarding IO and
MMIO, but the only callback site that honors bus number reservation is
limited to "root" (in the previously defined sense) hotplug controllers.

So this is something that will need investigation, and my most recent
queries into the "hotplug preparation" parts of PciBusDxe indicate that
those parts are quite... "forgotten". :) I guess this might be because
on physical systems the level of PCI(e) hotpluggery that we plan to do
is likely unheard of :)

Thanks!
Laszlo



Re: [Qemu-devel] [SeaBIOS] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-26 Thread Laszlo Ersek
On 07/26/17 23:54, Alexander Bezzubikov wrote:
> 2017-07-26 22:43 GMT+03:00 Michael S. Tsirkin :
>> On Sun, Jul 23, 2017 at 01:15:41AM +0300, Aleksandr Bezzubikov wrote:

>>> +PCIBridgeQemuCap cap;
>>
>> This leaks info to guest. You want to init all fields here:
>>
>> cap = {
>>  .len = 
>> };
> 
> I surely can do this for len field, but as Laszlo proposed
> we can use mutually exclusive fields,
> e.g. pref_32 and pref_64, the only way I have left
> is to use ternary operator (if we surely need this
> big initializer). Keeping some if's would look better,
> I think.

I think it's fine to use "if"s in order to set up the structure
partially / gradually, but then please clear the structure up-front:


  PCIBridgeQemuCap cap = { 0 };

(In general "{ 0 }" is the best initializer ever, because it can
zero-init a variable of *any* type at all. Gcc might complain about the
inexact depth of {} nesting of course, but it's nonetheless valid C.)

Or else add a memset-to-zero.

Or else, do just

  PCIBridgeQemuCap cap = { .len = ... };

which will zero-fill every other field. ("[...] all subobjects that are
not initialized explicitly shall be initialized implicitly the same as
objects that have static storage duration").

Thanks
Laszlo



Re: [Qemu-devel] [SeaBIOS] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-27 Thread Laszlo Ersek
On 07/27/17 11:39, Marcel Apfelbaum wrote:
> On 27/07/2017 2:28, Michael S. Tsirkin wrote:
>> On Thu, Jul 27, 2017 at 12:54:07AM +0300, Alexander Bezzubikov wrote:
>>> 2017-07-26 22:43 GMT+03:00 Michael S. Tsirkin :
 On Sun, Jul 23, 2017 at 01:15:41AM +0300, Aleksandr Bezzubikov wrote:
> On PCI init PCI bridges may need some
> extra info about bus number to reserve, IO, memory and
> prefetchable memory limits. QEMU can provide this
> with special

 with a special

> vendor-specific PCI capability.
>
> Sizes of limits match ones from
> PCI Type 1 Configuration Space Header,
> number of buses to reserve occupies only 1 byte
> since it is the size of Subordinate Bus Number register.
>
> Signed-off-by: Aleksandr Bezzubikov 
> ---
>   hw/pci/pci_bridge.c | 27 +++
>   include/hw/pci/pci_bridge.h | 18 ++
>   2 files changed, 45 insertions(+)
>
> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
> index 720119b..8ec6c2c 100644
> --- a/hw/pci/pci_bridge.c
> +++ b/hw/pci/pci_bridge.c
> @@ -408,6 +408,33 @@ void pci_bridge_map_irq(PCIBridge *br, const
> char* bus_name,
>   br->bus_name = bus_name;
>   }
>
> +
> +int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,

 help? should be qemu_cap_init?

> +  uint8_t bus_reserve, uint32_t io_limit,
> +  uint16_t mem_limit, uint64_t
> pref_limit,
> +  Error **errp)
> +{
> +size_t cap_len = sizeof(PCIBridgeQemuCap);
> +PCIBridgeQemuCap cap;

 This leaks info to guest. You want to init all fields here:

 cap = {
   .len = 
 };
>>>
>>> I surely can do this for len field, but as Laszlo proposed
>>> we can use mutually exclusive fields,
>>> e.g. pref_32 and pref_64, the only way I have left
>>> is to use ternary operator (if we surely need this
>>> big initializer). Keeping some if's would look better,
>>> I think.
>>>

> +
> +cap.len = cap_len;
> +cap.bus_res = bus_reserve;
> +cap.io_lim = io_limit & 0xFF;
> +cap.io_lim_upper = io_limit >> 8 & 0x;
> +cap.mem_lim = mem_limit;
> +cap.pref_lim = pref_limit & 0x;
> +cap.pref_lim_upper = pref_limit >> 16 & 0x;

 Please use pci_set_word etc or cpu_to_leXX.

>>>
>>> Since now we've decided to avoid fields separation into  +
>>> ,
>>> this bitmask along with pci_set_word are no longer needed.
>>>
 I think it's easiest to replace struct with a set of macros then
 pci_set_word does the work for you.

>>>
>>> I don't really want to use macros here because structure
>>> show us the whole capability layout and this can
>>> decrease documenting efforts. More than that,
>>> memcpy usage is very convenient here, and I wouldn't like
>>> to lose it.
>>>

> +
> +int offset = pci_add_capability(dev, PCI_CAP_ID_VNDR,
> +cap_offset, cap_len, errp);
> +if (offset < 0) {
> +return offset;
> +}
> +
> +memcpy(dev->config + offset + 2, (char *)&cap + 2, cap_len - 2);

 +2 is yacky. See how virtio does it:

  memcpy(dev->config + offset + PCI_CAP_FLAGS, &cap->cap_len,
 cap->cap_len - PCI_CAP_FLAGS);


>>>
>>> OK.
>>>
> +return 0;
> +}
> +
>   static const TypeInfo pci_bridge_type_info = {
>   .name = TYPE_PCI_BRIDGE,
>   .parent = TYPE_PCI_DEVICE,
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index ff7cbaa..c9f642c 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -67,4 +67,22 @@ void pci_bridge_map_irq(PCIBridge *br, const
> char* bus_name,
>   #define  PCI_BRIDGE_CTL_DISCARD_STATUS   0x400   /* Discard
> timer status */
>   #define  PCI_BRIDGE_CTL_DISCARD_SERR 0x800   /* Discard timer
> SERR# enable */
>
> +typedef struct PCIBridgeQemuCap {
> +uint8_t id; /* Standard PCI capability header field */
> +uint8_t next;   /* Standard PCI capability header field */
> +uint8_t len;/* Standard PCI vendor-specific capability
> header field */
> +uint8_t bus_res;
> +uint32_t pref_lim_upper;

 Big endian? Ugh.

>>>
>>> Agreed, and this's gonna to disappear with
>>> the new layout.
>>>
> +uint16_t pref_lim;
> +uint16_t mem_lim;

 I'd say we need 64 bit for memory.

>>>
>>> Why? Non-prefetchable MEMORY_LIMIT register is 16 bits long.
>>
>> Hmm ok, but e.g. for io there are bridges that have extra registers
>> to specify non-standard non-aligned registers.
>>
> +uint16_t io_lim_upper;
> +uint8_t io_lim;
> +uint8_t padding

Re: [Qemu-devel] [SeaBIOS] Commit 77af8a2b95b79699de650965d5228772743efe84 breaks Windows 2000 support

2017-07-27 Thread Laszlo Ersek
On 07/27/17 16:59, Kevin O'Connor wrote:
> On Wed, Jul 26, 2017 at 04:21:23PM -0400, Paolo Bonzini wrote:

>>> C - We'd be introducing "shared ownership" of the acpi tables.  Some
>>> of the tables would be produced by QEMU and some of them by
>>> SeaBIOS.  Explaining when and why to future developers would be
>>> a challenge.
>>
>> The advantage is that the same shared ownership is already present in
>> OVMF.  The RSDP/RSDT/XSDT are entirely created by the firmware in
>> OVMF. (The rev1 FADT isn't but that's just missing code; the table
>> manager in general would be ready for that).  In any case this
>> doesn't seem like something that cannot be solved by code comments.
>
> I'd argue that the shared ownership in the EDK2 code was a poor design
> choice.

The reason we can't just exclude the reference implementation of
EFI_ACPI_TABLE_PROTOCOL from OVMF whole-sale, and reimplement the ACPI
linker/loader from scratch, is that some other (independent) edk2
modules will want to use EFI_ACPI_TABLE_PROTOCOL for installing their
own (one-off) tables, such as IBFT, BGRT and so on, *in addition to*
QEMU's. Given that these ACPI tables mostly do *not* describe hardware
(but software features and/or configuration), it's hard to claim that
they should also be generated by QEMU.

Therefore the dual origin for ACPI tables looks unavoidable in UEFI,
it's just that there should be a lot more flexible "connect" from QEMU's
linker/loader to the installed ACPI tables than EFI_ACPI_TABLE_PROTOCOL.

Basically this is a fight over ownership. Each of QEMU's ACPI
linker/loader and EFI_ACPI_TABLE_PROTOCOL thinks that it fully owns the
root of the table tree. :(

> Case in point - we're only having this conversation because of its
> limitations - SeaBIOS is capable of deploying the acpi tables in the
> proposed layout without any code changes today.

Yes.

But let's not forget that SeaBIOS is capable of delegating the full
low-level construction of the table tree to QEMU because no independent
/ 3rd party BIOS-level code wants to install its own tables (again,
IBFT, BGRT, ...) This is not true of UEFI, where the guiding principle
of the standardized interfaces is to enable cooperation between
independent, binary-only modules. (So, for example, if you shove a new
PCI add-on card in your motherboard, the UEFI driver in that oprom could
install a separate ACPI table, by looking up and calling
EFI_ACPI_TABLE_PROTOCOL.)

> I'm not against changing SeaBIOS, but it's a priority for me that we
> continue to make it possible to deploy future ACPI table changes (no
> matter how quirky) in a way that does not require future SeaBIOS
> releases.

It's a good goal.

I apologize for forgetting the context, but what exactly was the
argument against:

- splitting modern ACPI generation from ancient ACPI generation (so that
we can assign separate maintainers to ancient vs. modern),

- restricting ancient ACPI generation to old machine types?

Thanks,
Laszlo



Re: [Qemu-devel] [qemu PATCH for 2.10] i386: acpi: provide an XSDT instead of an RSDT

2017-07-28 Thread Laszlo Ersek
On 07/27/17 22:40, Kevin O'Connor wrote:
> On Wed, Jul 26, 2017 at 11:31:36AM +0200, Paolo Bonzini wrote:
>> The tables that QEMU provides are not ACPI 1.0 compatible since commit
>> 77af8a2b95 ("hw/i386: Use Rev3 FADT (ACPI 2.0) instead of Rev1 to improve
>> guest OS support.", 2017-05-03).  This is visible with Windows 2000,
>> which refuses to parse the rev3 FADT and fails to boot.
>>
>> The recommended solution in this case is to build two FADTs, v1 being
>> pointed to by the RSDT and v3 by the XSDT.  However, we leave this task
>> to the firmware.  This patch simply switches the RSDT to the XSDT, which
>> is valid for all ACPI 2.0-friendly operating systems and also leaves
>> SeaBIOS the freedom to build an RSDT that points to the compatibility
>> FADT.
> 
> Another possible solution to this issue would be for QEMU to instruct
> the firmware to build both rev1 and rev3 FADTs, but be clear which
> links are for legacy purposes only.  This could be done with a new
> ADD_LEGACY_POINTER linker loader command.  Existing firmwares should
> ignore the new ADD_LEGACY_POINTER command and new versions of SeaBIOS
> could be extended to honor it.

I confirm OVMF ignores (skips) unknown commands.

But, so I can understand better, can you please explain what the effect
of these patches would be? IIUC, some pointer updates would not be
performed in OVMF (and old SeaBIOS) that would take place in new
SeaBIOS. What pointers are these exactly (where do they live and what do
they point at)?

- RSDT[0] would point to FADTv1, RSDT[n] (n>=1) would point to the rest
of the tables, and OVMF wouldn't set (or follow) any of these pointers,
- XSDT[0] would point to FADTv3, XSDT[n] (n>=1) would point to the rest
of the tables, and both SeaBIOS and OVMF would see these pointers,
- RSDP.RSDT would point to the RSDT, and OVMF would not see (or follow)
this pointer,
- RSDP.XSDT would point to the XSDT, and both SeaBIOS and OVMF would see
this pointer.

Is this a correct interpretation? If so, I think it would work for OVMF.

First, OVMF would not patch RSDP.RSDT, nor RSDT[n] (n>=0).

Second, in the 2nd phase processing of pointers, OVMF would not follow
the RSDP.RSDT link for identifying the pointed-to RSDT for installation
(which is not really relevant, since OVMF skips the installation of the
RSDT anyway, when it recognizes it).

Third, none of the RSDT[n] links would be followed for identifying other
tables for installation; meaning neither FADTv1 nor the other (commonly
used) tables would be identified / installed. XSDT would work like now,
and a FADTv3 plus the rest of the tables would be installed from that go.

Fourth, what about the links within the FADTv1 (to the FACS and DSDT)?
AFAICS in build_fadt1(), those pointers continue to be patched with the
non-legacy ADD_POINTER command. This is not necessarily a problem if
FADTv1.FACS and FADTv3.FACS point to the exact same address (similarly
if FADTv1.DSDT and FADTv3.DSDT point to the exact same address), because
OVMF already has a kind of memoization against installing the exact same
pointed-to table twice (e.g., when FADTv3.DSDT and FADTv3.X_DSDT refer
to the same address). Still, for completeness, maybe the FADTv1.FACS and
FADTv1.DSDT pointers should also be patched with the new legacy
ADD_POINTER command, in build_fadt1().

Basically, once you split a pointer between the RSDT "tree" and the XSDT
"tree", all the pointers to ACPI data tables in that table-subtree
(recursively) should be patched accordingly (all legacy or all
non-legacy). Pointers to other things than ACPI data tables need no
special handling (as their identification / probing is already
suppressed with suitable zero prefixes).

Thanks!
Laszlo

> 
> I proto-typed it (but haven't done significant testing).  Admittedly,
> it is a pretty ugly hack.
> 
> -Kevin
> 
> 
> == SeaBIOS patch ===
> 
> --- a/src/fw/romfile_loader.c
> +++ b/src/fw/romfile_loader.c
> @@ -234,6 +234,7 @@ int romfile_loader_execute(const char *name)
>  case ROMFILE_LOADER_COMMAND_ALLOCATE:
>  romfile_loader_allocate(entry, files);
>  break;
> +case ROMFILE_LOADER_COMMAND_ADD_LEGACY_POINTER:
>  case ROMFILE_LOADER_COMMAND_ADD_POINTER:
>  romfile_loader_add_pointer(entry, files);
>  break;
> diff --git a/src/fw/romfile_loader.h b/src/fw/romfile_loader.h
> index fcd4ab2..4e266e8 100644
> --- a/src/fw/romfile_loader.h
> +++ b/src/fw/romfile_loader.h
> @@ -77,6 +77,7 @@ enum {
>  ROMFILE_LOADER_COMMAND_ADD_POINTER= 0x2,
>  ROMFILE_LOADER_COMMAND_ADD_CHECKSUM   = 0x3,
>  ROMFILE_LOADER_COMMAND_WRITE_POINTER  = 0x4,
> +ROMFILE_LOADER_COMMAND_ADD_LEGACY_POINTER = 0x5,
>  };
>  
>  enum {
> 
> 
> == QEMU patch ===
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 36a6cc4..eed1a2c 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1576,7

Re: [Qemu-devel] [PATCH v4 2/8] acpi: add vmcoreinfo device

2017-07-28 Thread Laszlo Ersek
On 07/28/17 16:52, Marc-André Lureau wrote:
> Hi Dave
> 
> On Wed, Jul 26, 2017 at 10:21 AM, Michael S. Tsirkin  wrote:
>> On Sat, Jul 15, 2017 at 01:47:50AM +0200, Marc-André Lureau wrote:

 There's more info scattered in other places.

 Why do you get to document it? Because you are the one exposing it
 across the hypervisor/vm boundary where it will need to be
 understood by people/tools not running within guest.

 So "just read the script in qemu source" is not how an interface
 should be documented.
>>>
>>> I don't understand the issue, it's a kernel ELF note that qemu passes
>>> for dump/crash tools in the dump headers/sections.
>>
>> The way it looks to me, this patchset is exposing an internal kernel
>> detail and making it part of ABI maybe it already is, my point was 1.
>> should we get a confirmation from upstream it's not going to change? 2.
>> if it's ABI let's document what do we expect to be there.
> 
> 
> Could you help explain the expectations and stability guarantees of
> vmcoreinfo ELF note ?
> 
> I am a bit stuck here, after all, vmcoreinfo is mostly used by crash
> so I thought you could help.
> 
> The only thing qemu does with it is try to get NUMBER(phys_base)=
> field to update the phys_base used in the various dump headers. (this
> could be dropped, and qemu ignoring the note content, if the debug
> tools take vmcoreinfo values  with higher priority than other header
> fields)

I agree; if "crash" guarantees that the vmcoreinfo note will override
whatever phys_base value QEMU may have guessed otherwise (from other
places) and written to some dedicated phys_base header fields, then in
QEMU we don't have to propagate phys_base from the vmcoreinfo note to
said other fields -- we can treat the vmcoreinfo note entirely opaquely.

Thanks
Laszlo

> 
>> But again since there's not a whole lot of documentation here
>> that you provided, I might be misunderstanding completely.
> 
> Because there isn't much available in the kernel either, except
> Documentation/ABI/testing/sysfs-kernel-vmcoreinfo.
> 
> 




Re: [Qemu-devel] [SeaBIOS] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-31 Thread Laszlo Ersek
On 07/29/17 01:15, Michael S. Tsirkin wrote:
> On Thu, Jul 27, 2017 at 03:58:58PM +0200, Laszlo Ersek wrote:
>> On 07/27/17 11:39, Marcel Apfelbaum wrote:
>>> On 27/07/2017 2:28, Michael S. Tsirkin wrote:
>>>> On Thu, Jul 27, 2017 at 12:54:07AM +0300, Alexander Bezzubikov wrote:
>>>>> 2017-07-26 22:43 GMT+03:00 Michael S. Tsirkin :
>>>>>> On Sun, Jul 23, 2017 at 01:15:41AM +0300, Aleksandr Bezzubikov wrote:
>>>>>>> On PCI init PCI bridges may need some
>>>>>>> extra info about bus number to reserve, IO, memory and
>>>>>>> prefetchable memory limits. QEMU can provide this
>>>>>>> with special
>>>>>>
>>>>>> with a special
>>>>>>
>>>>>>> vendor-specific PCI capability.
>>>>>>>
>>>>>>> Sizes of limits match ones from
>>>>>>> PCI Type 1 Configuration Space Header,
>>>>>>> number of buses to reserve occupies only 1 byte
>>>>>>> since it is the size of Subordinate Bus Number register.
>>>>>>>
>>>>>>> Signed-off-by: Aleksandr Bezzubikov 
>>>>>>> ---
>>>>>>>   hw/pci/pci_bridge.c | 27 +++
>>>>>>>   include/hw/pci/pci_bridge.h | 18 ++
>>>>>>>   2 files changed, 45 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>>>>>>> index 720119b..8ec6c2c 100644
>>>>>>> --- a/hw/pci/pci_bridge.c
>>>>>>> +++ b/hw/pci/pci_bridge.c
>>>>>>> @@ -408,6 +408,33 @@ void pci_bridge_map_irq(PCIBridge *br, const
>>>>>>> char* bus_name,
>>>>>>>   br->bus_name = bus_name;
>>>>>>>   }
>>>>>>>
>>>>>>> +
>>>>>>> +int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,
>>>>>>
>>>>>> help? should be qemu_cap_init?
>>>>>>
>>>>>>> +  uint8_t bus_reserve, uint32_t io_limit,
>>>>>>> +  uint16_t mem_limit, uint64_t
>>>>>>> pref_limit,
>>>>>>> +  Error **errp)
>>>>>>> +{
>>>>>>> +size_t cap_len = sizeof(PCIBridgeQemuCap);
>>>>>>> +PCIBridgeQemuCap cap;
>>>>>>
>>>>>> This leaks info to guest. You want to init all fields here:
>>>>>>
>>>>>> cap = {
>>>>>>   .len = 
>>>>>> };
>>>>>
>>>>> I surely can do this for len field, but as Laszlo proposed
>>>>> we can use mutually exclusive fields,
>>>>> e.g. pref_32 and pref_64, the only way I have left
>>>>> is to use ternary operator (if we surely need this
>>>>> big initializer). Keeping some if's would look better,
>>>>> I think.
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> +cap.len = cap_len;
>>>>>>> +cap.bus_res = bus_reserve;
>>>>>>> +cap.io_lim = io_limit & 0xFF;
>>>>>>> +cap.io_lim_upper = io_limit >> 8 & 0x;
>>>>>>> +cap.mem_lim = mem_limit;
>>>>>>> +cap.pref_lim = pref_limit & 0x;
>>>>>>> +cap.pref_lim_upper = pref_limit >> 16 & 0x;
>>>>>>
>>>>>> Please use pci_set_word etc or cpu_to_leXX.
>>>>>>
>>>>>
>>>>> Since now we've decided to avoid fields separation into  +
>>>>> ,
>>>>> this bitmask along with pci_set_word are no longer needed.
>>>>>
>>>>>> I think it's easiest to replace struct with a set of macros then
>>>>>> pci_set_word does the work for you.
>>>>>>
>>>>>
>>>>> I don't really want to use macros here because structure
>>>>> show us the whole capability layout and this can
>>>>> decrease documenting efforts. More than that,
>>>>> memcpy usage is very convenient here, and I wouldn't like
>>>>> to lose it.
>>>>>
>>>>>

Re: [Qemu-devel] [SeaBIOS] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-31 Thread Laszlo Ersek
On 07/31/17 20:55, Michael S. Tsirkin wrote:
> On Mon, Jul 31, 2017 at 08:16:49PM +0200, Laszlo Ersek wrote:
>> OK. If the proposed solution with the r/o mem base/limit registers is
>> rooted in the spec (and I think it indeed must be; apparently this would
>> be the same as what we're already planning for IO disablement), then
>> that's a strong argument for PciBusDxe to accommodate this probing in
>> the platform hook.
>>
>> Thanks
>> Laszlo
> 
> Do you mean making base/limit read-only?

Yes, I do. (Perhaps writing "r/o" was too terse.)

Thanks
Laszlo




Re: [Qemu-devel] [PATCH v3 5/5] docs: update documentation considering PCIE-PCI bridge

2017-08-01 Thread Laszlo Ersek
(Whenever my comments conflict with Michael's or Marcel's, I defer to them.)

On 07/29/17 01:37, Aleksandr Bezzubikov wrote:
> Signed-off-by: Aleksandr Bezzubikov 
> ---
>  docs/pcie.txt|  46 ++
>  docs/pcie_pci_bridge.txt | 121 
> +++
>  2 files changed, 147 insertions(+), 20 deletions(-)
>  create mode 100644 docs/pcie_pci_bridge.txt
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> index 5bada24..338b50e 100644
> --- a/docs/pcie.txt
> +++ b/docs/pcie.txt
> @@ -46,7 +46,7 @@ Place only the following kinds of devices directly on the 
> Root Complex:
>  (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI 
> Express
>  hierarchies.
>  
> -(3) DMI-PCI Bridges (i82801b11-bridge), for starting legacy PCI
> +(3) PCIE-PCI Bridge (pcie-pci-bridge), for starting legacy PCI
>  hierarchies.
>  
>  (4) Extra Root Complexes (pxb-pcie), if multiple PCI Express Root Buses

When reviewing previous patches modifying / adding this file, I
requested that we spell out "PCI Express" every single time. I'd like to
see the same in this patch, if possible.

> @@ -55,18 +55,18 @@ Place only the following kinds of devices directly on the 
> Root Complex:
> pcie.0 bus
> 
> 
>  |||  |
> -   ---   --   --   --
> -   | PCI Dev |   | PCIe Root Port |   | DMI-PCI Bridge |   |  pxb-pcie  |
> -   ---   --   --   --
> +   ---   --   ---   --
> +   | PCI Dev |   | PCIe Root Port |   | PCIE-PCI Bridge |   |  pxb-pcie  |
> +   ---   --   ---   --
>  
>  2.1.1 To plug a device into pcie.0 as a Root Complex Integrated Endpoint use:
>-device [,bus=pcie.0]
>  2.1.2 To expose a new PCI Express Root Bus use:
>-device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
> -  Only PCI Express Root Ports and DMI-PCI bridges can be connected
> +  Only PCI Express Root Ports, PCIE-PCI bridges and DMI-PCI bridges can 
> be connected

It would be nice if we could keep the flowing text wrapped to 80 chars.

Also, here you add the "PCI Express-PCI" bridge to the list of allowed
controllers (and you keep DMI-PCI as permitted), but above DMI was
replaced. I think these should be made consistent -- we should make up
our minds if we continue to recommend the DMI-PCI bridge or not. If not,
then we should eradicate all traces of it. If we want to keep it at
least for compatibility, then it should remain as fully documented as it
is now.

>to the pcie.1 bus:
>-device 
> ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z]  
>\
> -  -device i82801b11-bridge,id=dmi_pci_bridge1,bus=pcie.1
> +  -device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
>  
>  
>  2.2 PCI Express only hierarchy
> @@ -130,21 +130,25 @@ Notes:
>  Legacy PCI devices can be plugged into pcie.0 as Integrated Endpoints,
>  but, as mentioned in section 5, doing so means the legacy PCI
>  device in question will be incapable of hot-unplugging.
> -Besides that use DMI-PCI Bridges (i82801b11-bridge) in combination
> +Besides that use PCIE-PCI Bridges (pcie-pci-bridge) in combination
>  with PCI-PCI Bridges (pci-bridge) to start PCI hierarchies.
> +Instead of the PCIE-PCI Bridge DMI-PCI one can be used,
> +but it doens't support hot-plug, is not crossplatform and since that

s/doens't/doesn't/

s/since that/therefore it/

> +is obsolete and deprecated. Use the PCIE-PCI Bridge if you're not 
> +absolutely sure you need the DMI-PCI Bridge.
>  
> -Prefer flat hierarchies. For most scenarios a single DMI-PCI Bridge
> +Prefer flat hierarchies. For most scenarios a single PCIE-PCI Bridge
>  (having 32 slots) and several PCI-PCI Bridges attached to it
>  (each supporting also 32 slots) will support hundreds of legacy devices.
> -The recommendation is to populate one PCI-PCI Bridge under the DMI-PCI Bridge
> +The recommendation is to populate one PCI-PCI Bridge under the PCIE-PCI 
> Bridge
>  until is full and then plug a new PCI-PCI Bridge...
>  
> pcie.0 bus
> --
>  ||
> -   ---   --
> -   | PCI Dev |   | DMI-PCI BRIDGE |
> -   ----
> +   ---   ---
> +   | PCI Dev |   | PCIE-PCI BRIDGE |
> +   -----
> ||
>----
>| PCI-PCI Bridge || PCI-PCI Bridge |   ..

Re: [Qemu-devel] [PATCH v3 5/5] docs: update documentation considering PCIE-PCI bridge

2017-08-01 Thread Laszlo Ersek
On 08/01/17 23:39, Michael S. Tsirkin wrote:
> On Wed, Aug 02, 2017 at 12:33:12AM +0300, Alexander Bezzubikov wrote:
>> 2017-08-01 23:31 GMT+03:00 Laszlo Ersek :
>>> (Whenever my comments conflict with Michael's or Marcel's, I defer to them.)
>>>
>>> On 07/29/17 01:37, Aleksandr Bezzubikov wrote:
>>>> Signed-off-by: Aleksandr Bezzubikov 
>>>> ---
>>>>  docs/pcie.txt|  46 ++
>>>>  docs/pcie_pci_bridge.txt | 121 
>>>> +++
>>>>  2 files changed, 147 insertions(+), 20 deletions(-)
>>>>  create mode 100644 docs/pcie_pci_bridge.txt
>>>>
>>>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>>>> index 5bada24..338b50e 100644
>>>> --- a/docs/pcie.txt
>>>> +++ b/docs/pcie.txt
>>>> @@ -46,7 +46,7 @@ Place only the following kinds of devices directly on 
>>>> the Root Complex:
>>>>  (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI 
>>>> Express
>>>>  hierarchies.
>>>>
>>>> -(3) DMI-PCI Bridges (i82801b11-bridge), for starting legacy PCI
>>>> +(3) PCIE-PCI Bridge (pcie-pci-bridge), for starting legacy PCI
>>>>  hierarchies.
>>>>
>>>>  (4) Extra Root Complexes (pxb-pcie), if multiple PCI Express Root 
>>>> Buses
>>>
>>> When reviewing previous patches modifying / adding this file, I
>>> requested that we spell out "PCI Express" every single time. I'd like to
>>> see the same in this patch, if possible.
>>
>> OK, I didn't know it.
>>
>>>
>>>> @@ -55,18 +55,18 @@ Place only the following kinds of devices directly on 
>>>> the Root Complex:
>>>> pcie.0 bus
>>>> 
>>>> 
>>>>  |||  |
>>>> -   ---   --   --   --
>>>> -   | PCI Dev |   | PCIe Root Port |   | DMI-PCI Bridge |   |  pxb-pcie  |
>>>> -   ---   --   --   --
>>>> +   ---   --   ---   --
>>>> +   | PCI Dev |   | PCIe Root Port |   | PCIE-PCI Bridge |   |  pxb-pcie  |
>>>> +   ---   --   ---   --
>>>>
>>>>  2.1.1 To plug a device into pcie.0 as a Root Complex Integrated Endpoint 
>>>> use:
>>>>-device [,bus=pcie.0]
>>>>  2.1.2 To expose a new PCI Express Root Bus use:
>>>>-device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
>>>> -  Only PCI Express Root Ports and DMI-PCI bridges can be connected
>>>> +  Only PCI Express Root Ports, PCIE-PCI bridges and DMI-PCI bridges 
>>>> can be connected
>>>
>>> It would be nice if we could keep the flowing text wrapped to 80 chars.
>>>
>>> Also, here you add the "PCI Express-PCI" bridge to the list of allowed
>>> controllers (and you keep DMI-PCI as permitted), but above DMI was
>>> replaced. I think these should be made consistent -- we should make up
>>> our minds if we continue to recommend the DMI-PCI bridge or not. If not,
>>> then we should eradicate all traces of it. If we want to keep it at
>>> least for compatibility, then it should remain as fully documented as it
>>> is now.
>>
>> Now I'm beginning to think that we shouldn't keep the DMI-PCI bridge
>> even for compatibility and may want to use a new PCIE-PCI bridge
>> everywhere (of course, except some cases when users are
>> sure they need exactly DMI-PCI bridge for some reason)
> 
> Can dmi-pci support shpc? why doesn't it? For compatibility?

I don't know why, but the fact that it doesn't is the reason libvirt
settled on auto-creating a dmi-pci bridge and a pci-pci bridge under
that for Q35. The reasoning was (IIRC Laine's words correctly) that the
dmi-pci bridge cannot receive hotplugged devices, while the pci-pci
bridge cannot be connected to the root complex. So both were needed.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v3 5/5] docs: update documentation considering PCIE-PCI bridge

2017-08-02 Thread Laszlo Ersek
On 08/02/17 15:47, Michael S. Tsirkin wrote:
> On Wed, Aug 02, 2017 at 12:23:46AM +0200, Laszlo Ersek wrote:
>> On 08/01/17 23:39, Michael S. Tsirkin wrote:
>>> On Wed, Aug 02, 2017 at 12:33:12AM +0300, Alexander Bezzubikov wrote:
>>>> 2017-08-01 23:31 GMT+03:00 Laszlo Ersek :
>>>>> (Whenever my comments conflict with Michael's or Marcel's, I defer to 
>>>>> them.)
>>>>>
>>>>> On 07/29/17 01:37, Aleksandr Bezzubikov wrote:
>>>>>> Signed-off-by: Aleksandr Bezzubikov 
>>>>>> ---
>>>>>>  docs/pcie.txt|  46 ++
>>>>>>  docs/pcie_pci_bridge.txt | 121 
>>>>>> +++
>>>>>>  2 files changed, 147 insertions(+), 20 deletions(-)
>>>>>>  create mode 100644 docs/pcie_pci_bridge.txt
>>>>>>
>>>>>> diff --git a/docs/pcie.txt b/docs/pcie.txt
>>>>>> index 5bada24..338b50e 100644
>>>>>> --- a/docs/pcie.txt
>>>>>> +++ b/docs/pcie.txt
>>>>>> @@ -46,7 +46,7 @@ Place only the following kinds of devices directly on 
>>>>>> the Root Complex:
>>>>>>  (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI 
>>>>>> Express
>>>>>>  hierarchies.
>>>>>>
>>>>>> -(3) DMI-PCI Bridges (i82801b11-bridge), for starting legacy PCI
>>>>>> +(3) PCIE-PCI Bridge (pcie-pci-bridge), for starting legacy PCI
>>>>>>  hierarchies.
>>>>>>
>>>>>>  (4) Extra Root Complexes (pxb-pcie), if multiple PCI Express Root 
>>>>>> Buses
>>>>>
>>>>> When reviewing previous patches modifying / adding this file, I
>>>>> requested that we spell out "PCI Express" every single time. I'd like to
>>>>> see the same in this patch, if possible.
>>>>
>>>> OK, I didn't know it.
>>>>
>>>>>
>>>>>> @@ -55,18 +55,18 @@ Place only the following kinds of devices directly 
>>>>>> on the Root Complex:
>>>>>> pcie.0 bus
>>>>>> 
>>>>>> 
>>>>>>  |||  |
>>>>>> -   ---   --   --   
>>>>>> --
>>>>>> -   | PCI Dev |   | PCIe Root Port |   | DMI-PCI Bridge |   |  pxb-pcie  
>>>>>> |
>>>>>> -   ---   --   --   
>>>>>> --
>>>>>> +   ---   --   ---   
>>>>>> --
>>>>>> +   | PCI Dev |   | PCIe Root Port |   | PCIE-PCI Bridge |   |  pxb-pcie 
>>>>>>  |
>>>>>> +   ---   --   ---   
>>>>>> --
>>>>>>
>>>>>>  2.1.1 To plug a device into pcie.0 as a Root Complex Integrated 
>>>>>> Endpoint use:
>>>>>>-device [,bus=pcie.0]
>>>>>>  2.1.2 To expose a new PCI Express Root Bus use:
>>>>>>-device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
>>>>>> -  Only PCI Express Root Ports and DMI-PCI bridges can be connected
>>>>>> +  Only PCI Express Root Ports, PCIE-PCI bridges and DMI-PCI bridges 
>>>>>> can be connected
>>>>>
>>>>> It would be nice if we could keep the flowing text wrapped to 80 chars.
>>>>>
>>>>> Also, here you add the "PCI Express-PCI" bridge to the list of allowed
>>>>> controllers (and you keep DMI-PCI as permitted), but above DMI was
>>>>> replaced. I think these should be made consistent -- we should make up
>>>>> our minds if we continue to recommend the DMI-PCI bridge or not. If not,
>>>>> then we should eradicate all traces of it. If we want to keep it at
>>>>> least for compatibility, then it should remain as fully documented as it
>>>>> is now.
>>>>
>>>> Now I'm beginning to think that we shouldn't keep the DMI-PCI bridge
>>>> even for compatibility and may want to use a new PCIE-PCI bridge
>>>> everywhere (of course, except some cases when users are
>>>> sure they need exactly DMI-PCI bridge for some reason)
>>>
>>> Can dmi-pci support shpc? why doesn't it? For compatibility?
>>
>> I don't know why, but the fact that it doesn't is the reason libvirt
>> settled on auto-creating a dmi-pci bridge and a pci-pci bridge under
>> that for Q35. The reasoning was (IIRC Laine's words correctly) that the
>> dmi-pci bridge cannot receive hotplugged devices, while the pci-pci
>> bridge cannot be connected to the root complex. So both were needed.
>>
>> Thanks
>> Laszlo
> 
> OK. Is it true that dmi-pci + pci-pci under it will allow hotplug
> on Q35 if we just flip the bit in _OSC?

Marcel, what say you?... :)



Re: [Qemu-devel] [PATCH] docs/pcie.txt: Replace ioh3420 with pcie-root-port

2017-08-02 Thread Laszlo Ersek
On 08/02/17 17:51, Marcel Apfelbaum wrote:
> Do not mention ioh3420 in the "how to" doc.
> The device still works and can be used by already
> existing setups, but no need to be mentioned.
> 
> Suggested-by: Andrew Jones 
> Signed-off-by: Marcel Apfelbaum 
> ---
>  docs/pcie.txt | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> index 5bada24..f990033 100644
> --- a/docs/pcie.txt
> +++ b/docs/pcie.txt
> @@ -43,8 +43,8 @@ Place only the following kinds of devices directly on the 
> Root Complex:
>  strangely when PCI Express devices are integrated
>  with the Root Complex.
>  
> -(2) PCI Express Root Ports (ioh3420), for starting exclusively PCI 
> Express
> -hierarchies.
> +(2) PCI Express Root Ports (pcie-root-port), for starting exclusively
> +PCI Express hierarchies.
>  
>  (3) DMI-PCI Bridges (i82801b11-bridge), for starting legacy PCI
>  hierarchies.
> @@ -65,7 +65,7 @@ Place only the following kinds of devices directly on the 
> Root Complex:
>-device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
>Only PCI Express Root Ports and DMI-PCI bridges can be connected
>to the pcie.1 bus:
> -  -device 
> ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z]  
>\
> +  -device 
> pcie-root-port,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z]   
>   \
>-device i82801b11-bridge,id=dmi_pci_bridge1,bus=pcie.1
>  
>  
> @@ -107,14 +107,14 @@ Plug only PCI Express devices into PCI Express Ports.
>   
>  
>  2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
> -  -device 
> ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z]  \
> +  -device 
> pcie-root-port,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z]  \
>-device ,bus=root_port1
>  2.2.2 Using multi-function PCI Express Root Ports:
> -  -device 
> ioh3420,id=root_port1,multifunction=on,chassis=x,addr=z.0[,slot=y][,bus=pcie.0]
>  \
> -  -device 
> ioh3420,id=root_port2,chassis=x1,addr=z.1[,slot=y1][,bus=pcie.0] \
> -  -device 
> ioh3420,id=root_port3,chassis=x2,addr=z.2[,slot=y2][,bus=pcie.0] \
> +  -device 
> pcie-root-port,id=root_port1,multifunction=on,chassis=x,addr=z.0[,slot=y][,bus=pcie.0]
>  \
> +  -device 
> pcie-root-port,id=root_port2,chassis=x1,addr=z.1[,slot=y1][,bus=pcie.0] \
> +  -device 
> pcie-root-port,id=root_port3,chassis=x2,addr=z.2[,slot=y2][,bus=pcie.0] \
>  2.2.3 Plugging a PCI Express device into a Switch:
> -  -device ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z]  \
> +  -device 
> pcie-root-port,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z]  \
>-device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]   
>\
>-device 
> xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1,slot=y1[,addr=z1]]
>  \
>-device ,bus=downstream_port1
> 

I trust that you found all occurrences of "ioh3420" in the doc :)

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo



Re: [Qemu-devel] [PATCH 3/3] memory-mapping: skip non-volatile memory regions in GuestPhysBlockList

2018-11-05 Thread Laszlo Ersek
On 10/29/18 10:50, Paolo Bonzini wrote:
> On 03/10/2018 13:44, Marc-André Lureau wrote:
>> diff --git a/memory_mapping.c b/memory_mapping.c
>> index 775466f3a8..724dd0b417 100644
>> --- a/memory_mapping.c
>> +++ b/memory_mapping.c
>> @@ -206,7 +206,8 @@ static void guest_phys_blocks_region_add(MemoryListener 
>> *listener,
>>  
>>  /* we only care about RAM */
>>  if (!memory_region_is_ram(section->mr) ||
>> -memory_region_is_ram_device(section->mr)) {
>> +memory_region_is_ram_device(section->mr) ||
>> +memory_region_is_nonvolatile(section->mr)) {
>>  return;
>>  }
>>  
> 
> We should also have
> 
> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
> index 5a857cebcf..dd180b531c 100644
> --- a/scripts/dump-guest-memory.py
> +++ b/scripts/dump-guest-memory.py
> @@ -417,7 +417,9 @@ def get_guest_phys_blocks():
>  memory_region = flat_range["mr"].dereference()
> 
>  # we only care about RAM
> -if not memory_region["ram"]:
> +if not memory_region["ram"] \
> +   or memory_region["ram_device"] \
> +   or memory_region["nonvolatile"]:
>  continue
> 
>  section_size = int128_get64(flat_range["addr"]["size"])
> 
> here.  I queued the patches and will post this soon as a separate patch.

Thanks. I keep forgetting that this logic is duplicated.

Laszlo



Re: [Qemu-devel] [PULL 09/10] scripts/dump-guest-memory: Synchronize with guest_phys_blocks_region_add

2018-11-05 Thread Laszlo Ersek
On 10/30/18 20:50, Paolo Bonzini wrote:
> Recent patches have removed ram_device and nonvolatile RAM
> from dump-guest-memory's output.  Do the same for dumps
> that are extracted from a QEMU core file.
> 
> Reviewed-by: Marc-André Lureau 
> Signed-off-by: Paolo Bonzini 
> ---
>  scripts/dump-guest-memory.py | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/scripts/dump-guest-memory.py b/scripts/dump-guest-memory.py
> index 5a857ce..f04697b 100644
> --- a/scripts/dump-guest-memory.py
> +++ b/scripts/dump-guest-memory.py
> @@ -417,7 +417,9 @@ def get_guest_phys_blocks():
>  memory_region = flat_range["mr"].dereference()
>  
>  # we only care about RAM
> -if not memory_region["ram"]:
> +if not memory_region["ram"] \
> +   or memory_region["ram_device"] \
> +   or memory_region["nonvolatile"]:
>  continue
>  
>  section_size = int128_get64(flat_range["addr"]["size"])
> 

Sorry about the late comment, I've been away.

The line continuation style in the python script is inconsistent. When I
wrote the original version, my understanding was that the "Pythonic" way
to break up lines was to open a new parenthesized subexpression. This
way the logical "or" operator could be left at the end of the line. See
e.g. in the "get_guest_phys_blocks" method.

https://www.python.org/dev/peps/pep-0008/#maximum-line-length

> The preferred way of wrapping long lines is by using Python's implied
> line continuation inside parentheses, brackets and braces. Long lines
> can be broken over multiple lines by wrapping expressions in
> parentheses. These should be used in preference to using a backslash
> for line continuation.

However, several trailing backslashes have been added since, and I've
totally failed to catch them. I guess at this point either style should
be acceptable, in this script.

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo



Re: [Qemu-devel] [PULL 08/10] memory-mapping: skip non-volatile memory regions in GuestPhysBlockList

2018-11-05 Thread Laszlo Ersek
On 10/30/18 20:50, Paolo Bonzini wrote:
> From: Marc-André Lureau 
> 
> GuestPhysBlockList is currently used to produce dumps. Given the size
> and the typical usage of NVDIMM for storage, they are not a good idea
> to have in the dumps. We may want to have an extra dump option to
> include them. For now, skip non-volatile regions.
> 
> The TCG memory clear function is going to use the GuestPhysBlockList
> as well, and will thus skip NVDIMM for similar reasons.
> 
> Cc: ler...@redhat.com
> Signed-off-by: Marc-André Lureau 
> Message-Id: <20181003114454.5662-4-marcandre.lur...@redhat.com>
> Signed-off-by: Paolo Bonzini 
> ---
>  memory_mapping.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/memory_mapping.c b/memory_mapping.c
> index 775466f..724dd0b 100644
> --- a/memory_mapping.c
> +++ b/memory_mapping.c
> @@ -206,7 +206,8 @@ static void guest_phys_blocks_region_add(MemoryListener 
> *listener,
>  
>  /* we only care about RAM */
>  if (!memory_region_is_ram(section->mr) ||
> -memory_region_is_ram_device(section->mr)) {
> +memory_region_is_ram_device(section->mr) ||
> +memory_region_is_nonvolatile(section->mr)) {
>  return;
>  }
>  
> 

This patch misses my R-b, and (in chronological order) DavidH's, from:

http://mid.mail-archive.com/9fa8a684-8d5d-1644-3aee-86a196d31f8d@redhat.com
http://mid.mail-archive.com/79e58e5c-4d78-e93d-ebe8-4b1bb65752fe@redhat.com

Thanks
Laszlo



Re: [Qemu-devel] [PULL 00/33] pci, pc, virtio: fixes, features

2018-11-06 Thread Laszlo Ersek
On 11/06/18 13:39, Peter Maydell wrote:
> On 6 November 2018 at 11:20, Peter Maydell  wrote:
>> On 6 November 2018 at 11:07, Michael S. Tsirkin  wrote:
>>> On Tue, Nov 06, 2018 at 09:18:49AM +0100, Thomas Huth wrote:
>>>> On 2018-11-05 19:14, Michael S. Tsirkin wrote:
>>>>> The following changes since commit 
>>>>> b2f7a038bb4c4fc5ce6b8486e8513dfd97665e2a:
>>>>>
>>>>>   Merge remote-tracking branch 'remotes/rth/tags/pull-softfloat-20181104' 
>>>>> into staging (2018-11-05 10:32:49 +)
>>>>>
>>>>> are available in the Git repository at:
>>>>>
>>>>>   git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream
>>>>>
>>>>> for you to fetch changes up to 6196df5c8e6688c1c3f06f73442820066335337c:
>>>>>
>>>>>   vhost-scsi: prevent using uninitialized vqs (2018-11-05 12:59:35 -0500)
>>>>>
>>>>> 
>>>>> pci, pc, virtio: fixes, features
>>>>>
>>>>> AMD IOMMU VAPIC support + fixes all over the place.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin 
>>>>>
>>>>> 
>>>>> Gerd Hoffmann (1):
>>>>>   pci-testdev: add optional memory bar
>>>>>
>>>>> Laszlo Ersek (4):
>>>>>   MAINTAINERS: list "tests/acpi-test-data" files in ACPI/SMBIOS 
>>>>> section
>>>> [...]
>>>>>  tests/{acpi-test-data => data/acpi}/pc/APIC| Bin
>>>>>  tests/{acpi-test-data => data/acpi}/pc/APIC.cphp   | Bin
>>>>>  .../{acpi-test-data => data/acpi}/pc/APIC.dimmpxm  | Bin
>>>>>  tests/{acpi-test-data => data/acpi}/pc/DSDT| Bin
>>>>
>>>> So patch 1 moves tests/acpi-test-data/ to tests/data/acpi/ and patch 20
>>>> adds an entry for tests/acpi-test-data/ ? Does not make much sense to me
>>>> ... I think patch 20 needs to be adapted now.
>>>>
>>>>  Thomas
>>>
>>> Oh right, MAINTAINERS needs to be fixed. Can be done with a patch on top
>>> though.
>>
>> Yeah, given the timing for rc0 I'll just apply this version of
>> the pullreq, and we can fix up MAINTAINERS afterwards.
> 
> ...applied.

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH v4 3/3] hw/vfio/display: add ramfb support

2018-06-14 Thread Laszlo Ersek
On 06/14/18 00:36, Gerd Hoffmann wrote:
> On Wed, Jun 13, 2018 at 01:50:47PM -0600, Alex Williamson wrote:

>> I suppose in the UEFI case runtime services can be used to continue
>> writing this display,
> 
> Yes.

Small clarification for the wording -- "UEFI runtime services" do not
include anything display- or graphics-related. Instead, the OS kernel
may inherit the framebuffer properties (base address, size, pixel
format), and continue accessing the framebuffer directly.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH 9/9] hw/arm/virt: Add virt-3.0 machine type

2018-06-14 Thread Laszlo Ersek
Hi Eric,

On 06/14/18 08:27, Auger Eric wrote:
> Hi Laszlo,
> 
> On 06/13/2018 11:05 PM, Laszlo Ersek wrote:
>> On 06/13/18 10:48, Eric Auger wrote:
>>
>>> PATCH: merge of ECAM and VCPU extension
>>> - Laszlo reviewed the ECAM changes but I dropped his R-b
>>>   due to the squash
>>
>> Was there any particular reason why the previous patch set (with only
>> the ECAM enlargement) couldn't be merged first? To be honest I'm not
>> super happy when my R-b is dropped for non-technical reasons; it seems
>> like wasted work for both of us.
>>
>> Obviously if there's a technical dependency or some other reason why
>> committing the ECAM enlargement in separation would be *wrong*, that's
>> different. Even in that case, wouldn't it be possible to keep the
>> initial virt-3.0 machtype addition as I reviewed it, and then add the
>> rest in an incremental patch?
> 
> Sorry about that. My fear was about migration. We would have had 2 virt
> 3.0 machine models not supporting the same features. While bisecting
> migration we could have had the source using the high mem ECAM and the
> destination not supporting it. So I preferred to avoid this trouble by
> merging the 2 features in one patch. However I may have kept your R-b
> restricting its scope to the ECAM stuff.

to my understanding, it is normal to *gradually* add new properties
during the development cycle, to the new machine type of the upcoming
QEMU release. To my understanding, it's not expected that migration work
between development snapshots built from git. What matters is that two
official releases, specifying the same machine type, enable the user to
migrate a guest between them (in forward direction).

In every release, so many new features are introduced that it's
impossible to introduce the new machine type with all the compat knobs
added at once. Instead, the new machine type is introduced when the
first feature that requires a compat knob is added to git. All other
such features extend the compat knobs gradually, during the development
cycle. Until the new official release is made (which contains all the
compat knobs for all the new features), the new machine type simply
doesn't exist, as far as the public is concerned, so it cannot partake
in migration either.

This is my understanding anyway.

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH v5 4/4] Add ramfb MAINTAINERS entry

2018-06-15 Thread Laszlo Ersek
On 06/13/18 14:29, Gerd Hoffmann wrote:
> Signed-off-by: Gerd Hoffmann 
> ---
>  MAINTAINERS | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8a94517e9e..2401028766 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1331,6 +1331,12 @@ F: hw/display/bochs-display.c
>  F: include/hw/display/vga.h
>  F: include/hw/display/bochs-vbe.h
>  
> +ramfb
> +M: Gerd Hoffmann 
> +S: Maintained
> +F: hw/display/ramfb*.c
> +F: include/hw/display/ramfb.h
> +
>  virtio-gpu
>  M: Gerd Hoffmann 
>  S: Maintained
> 

I wonder if we should have a separate "display" section in this text
file, or some such.

Reviewed-by: Laszlo Ersek 

Thanks
Laszlo



Re: [Qemu-devel] [PATCH v5 2/4] hw/display: add standalone ramfb device

2018-06-15 Thread Laszlo Ersek
On 06/13/18 14:29, Gerd Hoffmann wrote:
> Signed-off-by: Gerd Hoffmann 
> ---
>  include/hw/display/ramfb.h|  3 +++
>  hw/arm/sysbus-fdt.c   |  7 +
>  hw/arm/virt.c |  2 ++
>  hw/display/ramfb-standalone.c | 62 
> +++
>  hw/i386/pc_piix.c |  2 ++
>  hw/i386/pc_q35.c  |  2 ++
>  hw/display/Makefile.objs  |  1 +
>  7 files changed, 79 insertions(+)
>  create mode 100644 hw/display/ramfb-standalone.c

Tested-by: Laszlo Ersek 




Re: [Qemu-devel] [PATCH v5 1/4] hw/display: add ramfb, a simple boot framebuffer living in guest ram

2018-06-15 Thread Laszlo Ersek
On 06/13/18 14:29, Gerd Hoffmann wrote:
> The boot framebuffer is expected to be configured by the firmware, so it
> uses fw_cfg as interface.  Initialization goes as follows:
> 
>   (1) Check whenever etc/ramfb is present.
>   (2) Allocate framebuffer from RAM.
>   (3) Fill struct RAMFBCfg, write it to etc/ramfb.
> 
> Done.  You can write stuff to the framebuffer now, and it should appear
> automagically on the screen.
> 
> Note that this isn't very efficient because it does a full display
> update on each refresh.  No dirty tracking.  Dirty tracking would have
> to be active for the whole ram slot, so that wouldn't be very efficient
> either.  For a boot display which is active for a short time only this
> isn't a big deal.  As permanent guest display something better should be
> used (if possible).
> 
> This is the ramfb core code.  Some windup is needed for display devices
> which want have a ramfb boot display.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  include/hw/display/ramfb.h |  9 +
>  hw/display/ramfb.c | 95 
> ++
>  hw/display/Makefile.objs   |  2 +
>  3 files changed, 106 insertions(+)
>  create mode 100644 include/hw/display/ramfb.h
>  create mode 100644 hw/display/ramfb.c

I tested this on KVM, built from Gerd's sirius/ramfb branch @ 778450b87275. 
Works fine with Gerd's corresponding edk2 QemuRamfbDxe driver (already merged):

  [edk2] [PATCH v3 0/4] Add QemuRamfbDxe driver
  http://mid.mail-archive.com/20180613072936.12480-1-kraxel@redhat.com

It also works fine with Ard's efifb patches:

  [PATCH 0/2] efi: add support for cacheable efifb mappings
  http://mid.mail-archive.com/20180615104818.23013-1-ard.biesheuvel@linaro.org

So, for this patch (not the series):

Tested-by: Laszlo Ersek 

Thanks!
Laszlo



Re: [Qemu-devel] [PATCH v2 05/11] hw/arm/virt: GICv3 DT node with one or two redistributor regions

2018-06-19 Thread Laszlo Ersek
Hi Eric,

sorry about the late followup. I have one question (mainly for Ard):

On 06/15/18 16:28, Eric Auger wrote:
> This patch allows the creation of a GICv3 node with 1 or 2
> redistributor regions depending on the number of smu_cpus.
> The second redistributor region is located just after the
> existing RAM region, at 256GB and contains up to up to 512 vcpus.
>
> Please refer to kernel documentation for further node details:
> Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt
>
> Signed-off-by: Eric Auger 
> Reviewed-by: Andrew Jones 
>
> ---
> v1 (virt3.0) -> v2
> - Added Drew's R-b
>
> v2 -> v3:
> - VIRT_GIC_REDIST2 is now 64MB large, ie. 512 redistributor capacity
> - virt_gicv3_redist_region_count does not test kvm_irqchip_in_kernel
>   anymore
> ---
>  hw/arm/virt.c | 29 -
>  include/hw/arm/virt.h | 14 ++
>  2 files changed, 38 insertions(+), 5 deletions(-)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 2885d18..d9f72eb 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -148,6 +148,8 @@ static const MemMapEntry a15memmap[] = {
>  [VIRT_PCIE_PIO] =   { 0x3eff, 0x0001 },
>  [VIRT_PCIE_ECAM] =  { 0x3f00, 0x0100 },
>  [VIRT_MEM] ={ 0x4000, RAMLIMIT_BYTES },
> +/* Additional 64 MB redist region (can contain up to 512 redistributors) 
> */
> +[VIRT_GIC_REDIST2] ={ 0x40ULL, 0x400 },
>  /* Second PCIe window, 512GB wide at the 512GB boundary */
>  [VIRT_PCIE_MMIO_HIGH] =   { 0x80ULL, 0x80ULL },
>  };
> @@ -401,13 +403,30 @@ static void fdt_add_gic_node(VirtMachineState *vms)
>  qemu_fdt_setprop_cell(vms->fdt, "/intc", "#size-cells", 0x2);
>  qemu_fdt_setprop(vms->fdt, "/intc", "ranges", NULL, 0);
>  if (vms->gic_version == 3) {
> +int nb_redist_regions = virt_gicv3_redist_region_count(vms);
> +
>  qemu_fdt_setprop_string(vms->fdt, "/intc", "compatible",
>  "arm,gic-v3");
> -qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
> - 2, vms->memmap[VIRT_GIC_DIST].base,
> - 2, vms->memmap[VIRT_GIC_DIST].size,
> - 2, vms->memmap[VIRT_GIC_REDIST].base,
> - 2, vms->memmap[VIRT_GIC_REDIST].size);
> +
> +qemu_fdt_setprop_cell(vms->fdt, "/intc",
> +  "#redistributor-regions", nb_redist_regions);
> +
> +if (nb_redist_regions == 1) {
> +qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
> + 2, vms->memmap[VIRT_GIC_DIST].base,
> + 2, vms->memmap[VIRT_GIC_DIST].size,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST].base,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST].size);
> +} else {
> +qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
> + 2, vms->memmap[VIRT_GIC_DIST].base,
> + 2, vms->memmap[VIRT_GIC_DIST].size,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST].base,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST].size,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST2].base,
> + 2, 
> vms->memmap[VIRT_GIC_REDIST2].size);
> +}
> +
>  if (vms->virt) {
>  qemu_fdt_setprop_cells(vms->fdt, "/intc", "interrupts",
> GIC_FDT_IRQ_TYPE_PPI, 
> ARCH_GICV3_MAINT_IRQ,

In edk2, we have the following code in
"ArmVirtPkg/Library/ArmVirtGicArchLib/ArmVirtGicArchLib.c":

  switch (GicRevision) {

  case 3:
//
// The GIC v3 DT binding describes a series of at least 3 physical (base
// addresses, size) pairs: the distributor interface (GICD), at least one
// redistributor region (GICR) containing dedicated redistributor
// interfaces for all individual CPUs, and the CPU interface (GICC).
// Under virtualization, we assume that the first redistributor region
// listed covers the boot CPU. Also, our GICv3 driver only supports the
// system register CPU interface, so we can safely ignore the MMIO version
// which is listed after the sequence of redistributor interfaces.
// This means we are only interested in the first two memory regions
// supplied, and ignore everything else.
//
ASSERT (RegSize >= 32);

// RegProp[0..1] == { GICD base, GICD size }
DistBase = SwapBytes64 (Reg[0]);
ASSERT (DistBase < MAX_UINTN);

// RegProp[2..3] == { GICR base, GICR size }
RedistBase = SwapBytes64 (Reg[2]);
ASSERT (RedistBase < MAX_UINTN);

PcdStatus = PcdSet64S (PcdGicDistributo

Re: [Qemu-devel] [PATCH v2 05/11] hw/arm/virt: GICv3 DT node with one or two redistributor regions

2018-06-20 Thread Laszlo Ersek
On 06/20/18 09:10, Auger Eric wrote:
> Hi Laszlo,
> 
> On 06/19/2018 09:02 PM, Ard Biesheuvel wrote:
>> On 19 June 2018 at 20:53, Laszlo Ersek  wrote:
>>> Hi Eric,
>>>
>>> sorry about the late followup. I have one question (mainly for Ard):
>>>
>>> On 06/15/18 16:28, Eric Auger wrote:
>>>> This patch allows the creation of a GICv3 node with 1 or 2
>>>> redistributor regions depending on the number of smu_cpus.
>>>> The second redistributor region is located just after the
>>>> existing RAM region, at 256GB and contains up to up to 512 vcpus.
>>>>
>>>> Please refer to kernel documentation for further node details:
>>>> Documentation/devicetree/bindings/interrupt-controller/arm,gic-v3.txt
>>>>
>>>> Signed-off-by: Eric Auger 
>>>> Reviewed-by: Andrew Jones 
>>>>
>>>> ---
>>>> v1 (virt3.0) -> v2
>>>> - Added Drew's R-b
>>>>
>>>> v2 -> v3:
>>>> - VIRT_GIC_REDIST2 is now 64MB large, ie. 512 redistributor capacity
>>>> - virt_gicv3_redist_region_count does not test kvm_irqchip_in_kernel
>>>>   anymore
>>>> ---
>>>>  hw/arm/virt.c | 29 -
>>>>  include/hw/arm/virt.h | 14 ++
>>>>  2 files changed, 38 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>> index 2885d18..d9f72eb 100644
>>>> --- a/hw/arm/virt.c
>>>> +++ b/hw/arm/virt.c
>>>> @@ -148,6 +148,8 @@ static const MemMapEntry a15memmap[] = {
>>>>  [VIRT_PCIE_PIO] =   { 0x3eff, 0x0001 },
>>>>  [VIRT_PCIE_ECAM] =  { 0x3f00, 0x0100 },
>>>>  [VIRT_MEM] ={ 0x4000, RAMLIMIT_BYTES },
>>>> +/* Additional 64 MB redist region (can contain up to 512 
>>>> redistributors) */
>>>> +[VIRT_GIC_REDIST2] ={ 0x40ULL, 0x400 },
>>>>  /* Second PCIe window, 512GB wide at the 512GB boundary */
>>>>  [VIRT_PCIE_MMIO_HIGH] =   { 0x80ULL, 0x80ULL },
>>>>  };
>>>> @@ -401,13 +403,30 @@ static void fdt_add_gic_node(VirtMachineState *vms)
>>>>  qemu_fdt_setprop_cell(vms->fdt, "/intc", "#size-cells", 0x2);
>>>>  qemu_fdt_setprop(vms->fdt, "/intc", "ranges", NULL, 0);
>>>>  if (vms->gic_version == 3) {
>>>> +int nb_redist_regions = virt_gicv3_redist_region_count(vms);
>>>> +
>>>>  qemu_fdt_setprop_string(vms->fdt, "/intc", "compatible",
>>>>  "arm,gic-v3");
>>>> -qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
>>>> - 2, vms->memmap[VIRT_GIC_DIST].base,
>>>> - 2, vms->memmap[VIRT_GIC_DIST].size,
>>>> - 2, vms->memmap[VIRT_GIC_REDIST].base,
>>>> - 2, 
>>>> vms->memmap[VIRT_GIC_REDIST].size);
>>>> +
>>>> +qemu_fdt_setprop_cell(vms->fdt, "/intc",
>>>> +  "#redistributor-regions", 
>>>> nb_redist_regions);
>>>> +
>>>> +if (nb_redist_regions == 1) {
>>>> +qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_DIST].base,
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_DIST].size,
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_REDIST].base,
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_REDIST].size);
>>>> +} else {
>>>> +qemu_fdt_setprop_sized_cells(vms->fdt, "/intc", "reg",
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_DIST].base,
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_DIST].size,
>>>> + 2, 
>>>> vms->memmap[VIRT_GIC_REDIST].base,
>>>> + 

Re: [Qemu-devel] [PATCH v3 3/4] acpi: build TPM Physical Presence interface

2018-06-20 Thread Laszlo Ersek
On 06/20/18 16:35, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jun 20, 2018 at 4:08 PM, Michael S. Tsirkin  wrote:
>> On Tue, May 15, 2018 at 02:14:32PM +0200, Marc-André Lureau wrote:
>>> From: Stefan Berger 
>>>
>>> The TPM Physical Presence interface consists of an ACPI part, a shared
>>> memory part, and code in the firmware. Users can send messages to the
>>> firmware by writing a code into the shared memory through invoking the
>>> ACPI code. When a reboot happens, the firmware looks for the code and
>>> acts on it by sending sequences of commands to the TPM.
>>>
>>> This patch adds the ACPI code. It is similar to the one in EDK2 but doesn't
>>> assume that SMIs are necessary to use. It uses a similar datastructure for
>>> the shared memory as EDK2 does so that EDK2 and SeaBIOS could both make use
>>> of it. I extended the shared memory data structure with an array of 256
>>> bytes, one for each code that could be implemented. The array contains
>>> flags describing the individual codes. This decouples the ACPI 
>>> implementation
>>> from the firmware implementation.
>>>
>>> The underlying TCG specification is accessible from the following page.
>>>
>>> https://trustedcomputinggroup.org/tcg-physical-presence-interface-specification/
>>>
>>> This patch implements version 1.30.
>>>
>>> Signed-off-by: Stefan Berger 
>>>
>>> ---
>>>
>>> v4 (Marc-André):
>>>  - replace 'DerefOf (FUNC [N])' with a function, to fix Windows ACPI
>>> handling.
>>>  - replace 'return Package (..) {} ' with scoped variables, to fix
>>>Windows ACPI handling.
>>>
>>> v3:
>>>  - add support for PPI to CRB
>>>  - split up OperationRegion TPPI into two parts, one containing
>>>the registers (TPP1) and the other one the flags (TPP2); switched
>>>the order of the flags versus registers in the code
>>>  - adapted ACPI code to small changes to the array of flags where
>>>previous flag 0 was removed and now shifting right wasn't always
>>>necessary anymore
>>>
>>> v2:
>>>  - get rid of FAIL variable; function 5 was using it and always
>>>returns 0; the value is related to the ACPI function call not
>>>a possible failure of the TPM function call.
>>>  - extend shared memory data structure with per-opcode entries
>>>holding flags and use those flags to determine what to return
>>>to caller
>>>  - implement interface version 1.3
>>> ---
>>>  include/hw/acpi/tpm.h |  21 +++
>>>  hw/i386/acpi-build.c  | 294 +-
>>>  2 files changed, 314 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/hw/acpi/tpm.h b/include/hw/acpi/tpm.h
>>> index f79d68a77a..fc53f08827 100644
>>> --- a/include/hw/acpi/tpm.h
>>> +++ b/include/hw/acpi/tpm.h
>>> @@ -196,4 +196,25 @@ REG32(CRB_DATA_BUFFER, 0x80)
>>>  #define TPM_PPI_VERSION_NONE0
>>>  #define TPM_PPI_VERSION_1_301
>>>
>>> +struct tpm_ppi {
>>
>> The name violate the coding style.
> 
> That's easy to change. Stefan could do it on commit if the rest of the
> patch is unchanged.
>>
>>
>>> +uint8_t  func[256];  /* 0x000: per TPM function implementation 
>>> flags;
>>> +   set by BIOS */
>>> +/* whether function is blocked by BIOS settings; bits 0, 1, 2 */
>>> +#define TPM_PPI_FUNC_NOT_IMPLEMENTED (0 << 0)
>>> +#define TPM_PPI_FUNC_BIOS_ONLY   (1 << 0)
>>> +#define TPM_PPI_FUNC_BLOCKED (2 << 0)
>>> +#define TPM_PPI_FUNC_ALLOWED_USR_REQ (3 << 0)
>>> +#define TPM_PPI_FUNC_ALLOWED_USR_NOT_REQ (4 << 0)
>>> +#define TPM_PPI_FUNC_MASK(7 << 0)
>>> +uint8_t ppin;/* 0x100 : set by BIOS */
>>
>> Are you sure it's right? Below ints will all end up misaligned ...
> 
> Hmm. Sadly, we didn't noticed when doing the edk2 part either. If we
> change it in qemu, we will have to change it in edk2 as well

I don't see why the misalignment is a problem. AIUI functionally it
shouldn't be an issue, and performance is not critical.

We did make sure the struct was packed in edk2 too.

Thanks,
Laszlo

> 
>>> +uint32_t ppip;   /* 0x101 : set by ACPI; not used */
>>> +uint32_t pprp;   /* 0x105 : response from TPM; set by BIOS */
>>> +uint32_t pprq;   /* 0x109 : opcode; set by ACPI */
>>> +uint32_t pprm;   /* 0x10d : parameter for opcode; set by ACPI 
>>> */
>>> +uint32_t lppr;   /* 0x111 : last opcode; set by BIOS */
>>> +uint32_t fret;   /* 0x115 : set by ACPI; not used */
>>> +uint8_t res1[0x40];  /* 0x119 : reserved for future use */
>>> +uint8_t next_step;   /* 0x159 : next step after reboot; set by 
>>> BIOS */
>>> +} QEMU_PACKED;
>>> +
>>>  #endif /* HW_ACPI_TPM_H */
>>
>> Igor could you pls take a quick look at the rest?
>>
>> --
>> MST
>>
> 
> thanks
> 
> 




Re: [Qemu-devel] [PATCH v3 2/4] acpi: add fw_cfg file for TPM and PPI virtual memory device

2018-06-21 Thread Laszlo Ersek
On 06/21/18 12:10, Marc-André Lureau wrote:

> What do you think Laszlo?

Apologies, I'm currently lacking the bandwidth to even understand the
question. I'm tagging this message for later; it'll take a while before
I get to it.

Thanks
Laszlo



Re: [Qemu-devel] [PATCH 1/2] sysbus: always allow explicit_ofw_unit_address() to override address generation

2018-06-25 Thread Laszlo Ersek
Hi Mark,

On 06/23/18 10:50, Mark Cave-Ayland wrote:
> Some SysBusDevices either use sysbus_init_mmio() without sysbus_mmio_map() or
> the first MMIO memory region doesn't represent the bus address, causing an 
> invalid
> firmware device path to be generated.
> 
> SysBusDeviceClass does provide a virtual explicit_ofw_unit_address() method 
> that
> can be used to override this process, but it is only considered as a fallback
> option meaning that any existing MMIO memory regions still take priority 
> whilst
> determining the firmware device address.
> 
> As any class defining explicit_ofw_unit_address() has explicitly requested a
> specialised behaviour then it should be used in preference to the default
> implementation rather than being used as a fallback.

I disagree about the last paragraph, when put like this. I don't
disagree with the *goal* of the patch, however the original
justification for explicit_ofw_unit_address() was different.

It was meant as a fallback for distinguishing sysbus devices when those
sysbus devices had neither MMIO nor PIO resources. The issue wasn't that
MMIO/PIO-based identification was not "right", the issue was that unique
identification was impossible in the absence of such resources. Please
see commit 0b336b3b98d8 ("hw/core: explicit OFW unit address callback
for SysBusDeviceClass", 2015-06-23).

I don't have anything against repurposing explicit_ofw_unit_address()
like this -- as long as you check that it doesn't change behavior for
existing devices -- it's just that we shouldn't justify the new purpose
with the original intent. The original intent was different.

I suggest stating, "we can have explicit_ofw_unit_address() take
priority in a backwards-compatible manner, because no sysbus device
currently has both explicit_ofw_unit_address() and MMIO/PIO resources".

(Obviously checking the validity of this statement is up to you; I'm
just suggesting what I'd see as one more precise explanation.)

Thanks,
Laszlo

> 
> Signed-off-by: Mark Cave-Ayland 
> ---
>  hw/core/sysbus.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
> index ecfb0cfc0e..1ee0c162f4 100644
> --- a/hw/core/sysbus.c
> +++ b/hw/core/sysbus.c
> @@ -293,16 +293,8 @@ static char *sysbus_get_fw_dev_path(DeviceState *dev)
>  {
>  SysBusDevice *s = SYS_BUS_DEVICE(dev);
>  SysBusDeviceClass *sbc = SYS_BUS_DEVICE_GET_CLASS(s);
> -/* for the explicit unit address fallback case: */
>  char *addr, *fw_dev_path;
>  
> -if (s->num_mmio) {
> -return g_strdup_printf("%s@" TARGET_FMT_plx, qdev_fw_name(dev),
> -   s->mmio[0].addr);
> -}
> -if (s->num_pio) {
> -return g_strdup_printf("%s@i%04x", qdev_fw_name(dev), s->pio[0]);
> -}
>  if (sbc->explicit_ofw_unit_address) {
>  addr = sbc->explicit_ofw_unit_address(s);
>  if (addr) {
> @@ -311,6 +303,13 @@ static char *sysbus_get_fw_dev_path(DeviceState *dev)
>  return fw_dev_path;
>  }
>  }
> +if (s->num_mmio) {
> +return g_strdup_printf("%s@" TARGET_FMT_plx, qdev_fw_name(dev),
> +   s->mmio[0].addr);
> +}
> +if (s->num_pio) {
> +return g_strdup_printf("%s@i%04x", qdev_fw_name(dev), s->pio[0]);
> +}
>  return g_strdup(qdev_fw_name(dev));
>  }
>  
> 




Re: [Qemu-devel] [PATCH v3 2/4] acpi: add fw_cfg file for TPM and PPI virtual memory device

2018-06-25 Thread Laszlo Ersek
On 06/21/18 12:10, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Jun 21, 2018 at 12:00 PM, Igor Mammedov  wrote:
>> On Tue, 15 May 2018 14:14:31 +0200
>> Marc-André Lureau  wrote:
>>
>>> From: Stefan Berger 
>>>
>>> To avoid having to hard code the base address of the PPI virtual
>>> memory device we introduce a fw_cfg file etc/tpm/config that holds the
>>> base address of the PPI device, the version of the PPI interface and
>>> the version of the attached TPM.
>> is it related to TPM_PPI_ADDR_BASE added in previous patch?
>>
>>>
>>> Signed-off-by: Stefan Berger 
>>> [ Marc-André: renamed to etc/tpm/config, made it static, document it ]
>>> Signed-off-by: Marc-André Lureau 
>>> ---
>>>  include/hw/acpi/tpm.h |  3 +++
>>>  hw/i386/acpi-build.c  | 17 +
>>>  docs/specs/tpm.txt| 20 
>>>  3 files changed, 40 insertions(+)
>>>
>>> diff --git a/include/hw/acpi/tpm.h b/include/hw/acpi/tpm.h
>>> index c082df7d1d..f79d68a77a 100644
>>> --- a/include/hw/acpi/tpm.h
>>> +++ b/include/hw/acpi/tpm.h
>>> @@ -193,4 +193,7 @@ REG32(CRB_DATA_BUFFER, 0x80)
>>>  #define TPM_PPI_ADDR_SIZE   0x400
>>>  #define TPM_PPI_ADDR_BASE   0xFED45000
>>>
>>> +#define TPM_PPI_VERSION_NONE0
>>> +#define TPM_PPI_VERSION_1_301
>>> +
>>>  #endif /* HW_ACPI_TPM_H */
>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>>> index 9bc6d97ea1..f6d447f03a 100644
>>> --- a/hw/i386/acpi-build.c
>>> +++ b/hw/i386/acpi-build.c
>>> @@ -119,6 +119,12 @@ typedef struct AcpiBuildPciBusHotplugState {
>>>  bool pcihp_bridge_en;
>>>  } AcpiBuildPciBusHotplugState;
>>>
>>> +typedef struct FWCfgTPMConfig {
>>> +uint32_t tpmppi_address;
>>> +uint8_t tpm_version;
>>> +uint8_t tpmppi_version;
>>> +} QEMU_PACKED FWCfgTPMConfig;
>>> +
>>>  static void init_common_fadt_data(Object *o, AcpiFadtData *data)
>>>  {
>>>  uint32_t io = object_property_get_uint(o, ACPI_PM_PROP_PM_IO_BASE, 
>>> NULL);
>>> @@ -2873,6 +2879,7 @@ void acpi_setup(void)
>>>  AcpiBuildTables tables;
>>>  AcpiBuildState *build_state;
>>>  Object *vmgenid_dev;
>>> +static FWCfgTPMConfig tpm_config;
>>>
>>>  if (!pcms->fw_cfg) {
>>>  ACPI_BUILD_DPRINTF("No fw cfg. Bailing out.\n");
>>> @@ -2907,6 +2914,16 @@ void acpi_setup(void)
>>>  fw_cfg_add_file(pcms->fw_cfg, ACPI_BUILD_TPMLOG_FILE,
>>>  tables.tcpalog->data, acpi_data_len(tables.tcpalog));
>>>
>>> +if (tpm_find()) {
>>> +tpm_config = (FWCfgTPMConfig) {
>>> +.tpmppi_address = cpu_to_le32(TPM_PPI_ADDR_BASE),
>>> +.tpm_version = cpu_to_le32(tpm_get_version(tpm_find())),
>>> +.tpmppi_version = cpu_to_le32(TPM_PPI_VERSION_NONE)
>>> +};
>>> +fw_cfg_add_file(pcms->fw_cfg, "etc/tpm/config",
>>> +&tpm_config, sizeof tpm_config);
>>> +}
>> why it's in ACPI part of the code, shouldn't it be a part of device,
>> could TPM be used without ACPI at all (-noacpi CLI option)?
>>
>> Wouldn't adding fwcfg entry unconditionally break migration?
> 
> Because of unstable entry IDs? that could be problematic. (especially
> during boot time) What do you think Laszlo?
> 
> I guess we could have a "ppi" device property, that would imply having
> the etc/tpm/config fw_cfg entry. We would enable it by default in
> newer machine types (3.0?)

Can we perhaps draw a parallel with "-device vmcoreinfo" here? For that
device model, fw_cfg_add_file_callback() is called in the realize
function, vmcoreinfo_realize(). If libvirt generates the identical
cmdline on both ends of the migration, and uses the same machine type, I
think the fw_cfg selectors should end up the same on both sides.

Thanks
Laszlo

>>
>>> +
>>>  vmgenid_dev = find_vmgenid_dev();
>>>  if (vmgenid_dev) {
>>>  vmgenid_add_fw_cfg(VMGENID(vmgenid_dev), pcms->fw_cfg,
>>> diff --git a/docs/specs/tpm.txt b/docs/specs/tpm.txt
>>> index c230c4c93e..2ddb768084 100644
>>> --- a/docs/specs/tpm.txt
>>> +++ b/docs/specs/tpm.txt
>>> @@ -20,6 +20,26 @@ QEMU files related to TPM TIS interface:
>>>   - hw/tpm/tpm_tis.h
>>>
>>>
>>> += fw_cfg interface =
>>> +
>>> +The bios/firmware may use the "etc/tpm/config" fw_cfg entry for
>>> +configuring the guest appropriately.
>>> +
>>> +The entry of 6 bytes has the following content, in little-endian:
>>> +
>>> +#define TPM_VERSION_UNSPEC  0
>>> +#define TPM_VERSION_1_2 1
>>> +#define TPM_VERSION_2_0 2
>>> +
>>> +#define TPM_PPI_VERSION_NONE0
>>> +#define TPM_PPI_VERSION_1_301
>>> +
>>> +struct FWCfgTPMConfig {
>>> +uint32_t tpmppi_address; /* PPI memory location */
>>> +uint8_t tpm_version; /* TPM version */
>>> +uint8_t tpmppi_version;  /* PPI version */
>>> +};
>>> +
>>>  = ACPI Interface =
>>>
>>>  The TPM device is defined with ACPI ID "PNP0C31". QEMU builds a SSDT and 
>>> passes
>>
>>
> 
> 
> 




Re: [Qemu-devel] [PATCH v3 2/4] acpi: add fw_cfg file for TPM and PPI virtual memory device

2018-06-26 Thread Laszlo Ersek
On 06/26/18 12:38, Marc-André Lureau wrote:
> Hi
> 
> On Mon, Jun 25, 2018 at 5:20 PM, Laszlo Ersek  wrote:
>> On 06/21/18 12:10, Marc-André Lureau wrote:
>>> Hi
>>>
>>> On Thu, Jun 21, 2018 at 12:00 PM, Igor Mammedov  wrote:
>>>> On Tue, 15 May 2018 14:14:31 +0200
>>>> Marc-André Lureau  wrote:
>>>>
>>>>> From: Stefan Berger 
>>>>>
>>>>> To avoid having to hard code the base address of the PPI virtual
>>>>> memory device we introduce a fw_cfg file etc/tpm/config that holds the
>>>>> base address of the PPI device, the version of the PPI interface and
>>>>> the version of the attached TPM.
>>>> is it related to TPM_PPI_ADDR_BASE added in previous patch?
>>>>
>>>>>
>>>>> Signed-off-by: Stefan Berger 
>>>>> [ Marc-André: renamed to etc/tpm/config, made it static, document it ]
>>>>> Signed-off-by: Marc-André Lureau 
>>>>> ---
>>>>>  include/hw/acpi/tpm.h |  3 +++
>>>>>  hw/i386/acpi-build.c  | 17 +
>>>>>  docs/specs/tpm.txt| 20 
>>>>>  3 files changed, 40 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/acpi/tpm.h b/include/hw/acpi/tpm.h
>>>>> index c082df7d1d..f79d68a77a 100644
>>>>> --- a/include/hw/acpi/tpm.h
>>>>> +++ b/include/hw/acpi/tpm.h
>>>>> @@ -193,4 +193,7 @@ REG32(CRB_DATA_BUFFER, 0x80)
>>>>>  #define TPM_PPI_ADDR_SIZE   0x400
>>>>>  #define TPM_PPI_ADDR_BASE   0xFED45000
>>>>>
>>>>> +#define TPM_PPI_VERSION_NONE0
>>>>> +#define TPM_PPI_VERSION_1_301
>>>>> +
>>>>>  #endif /* HW_ACPI_TPM_H */
>>>>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>>>>> index 9bc6d97ea1..f6d447f03a 100644
>>>>> --- a/hw/i386/acpi-build.c
>>>>> +++ b/hw/i386/acpi-build.c
>>>>> @@ -119,6 +119,12 @@ typedef struct AcpiBuildPciBusHotplugState {
>>>>>  bool pcihp_bridge_en;
>>>>>  } AcpiBuildPciBusHotplugState;
>>>>>
>>>>> +typedef struct FWCfgTPMConfig {
>>>>> +uint32_t tpmppi_address;
>>>>> +uint8_t tpm_version;
>>>>> +uint8_t tpmppi_version;
>>>>> +} QEMU_PACKED FWCfgTPMConfig;
>>>>> +
>>>>>  static void init_common_fadt_data(Object *o, AcpiFadtData *data)
>>>>>  {
>>>>>  uint32_t io = object_property_get_uint(o, ACPI_PM_PROP_PM_IO_BASE, 
>>>>> NULL);
>>>>> @@ -2873,6 +2879,7 @@ void acpi_setup(void)
>>>>>  AcpiBuildTables tables;
>>>>>  AcpiBuildState *build_state;
>>>>>  Object *vmgenid_dev;
>>>>> +static FWCfgTPMConfig tpm_config;
>>>>>
>>>>>  if (!pcms->fw_cfg) {
>>>>>  ACPI_BUILD_DPRINTF("No fw cfg. Bailing out.\n");
>>>>> @@ -2907,6 +2914,16 @@ void acpi_setup(void)
>>>>>  fw_cfg_add_file(pcms->fw_cfg, ACPI_BUILD_TPMLOG_FILE,
>>>>>  tables.tcpalog->data, acpi_data_len(tables.tcpalog));
>>>>>
>>>>> +if (tpm_find()) {
>>>>> +tpm_config = (FWCfgTPMConfig) {
>>>>> +.tpmppi_address = cpu_to_le32(TPM_PPI_ADDR_BASE),
>>>>> +.tpm_version = cpu_to_le32(tpm_get_version(tpm_find())),
>>>>> +.tpmppi_version = cpu_to_le32(TPM_PPI_VERSION_NONE)
>>>>> +};
>>>>> +fw_cfg_add_file(pcms->fw_cfg, "etc/tpm/config",
>>>>> +&tpm_config, sizeof tpm_config);
>>>>> +}
>>>> why it's in ACPI part of the code, shouldn't it be a part of device,
>>>> could TPM be used without ACPI at all (-noacpi CLI option)?
>>>>
>>>> Wouldn't adding fwcfg entry unconditionally break migration?
>>>
>>> Because of unstable entry IDs? that could be problematic. (especially
>>> during boot time) What do you think Laszlo?
>>>
>>> I guess we could have a "ppi" device property, that would imply having
>>> the etc/tpm/config fw_cfg entry. We would enable it by default in
>>> newer machine types (3.0?)
>>
>> Can we perhaps draw a parallel wit

Re: [Qemu-devel] [PATCH v3 4/4] tpm: add a fake ACPI memory clear interface

2018-06-26 Thread Laszlo Ersek
On 06/26/18 14:34, Igor Mammedov wrote:
> On Tue, 26 Jun 2018 11:22:26 +0200
> Marc-André Lureau  wrote:
> 
>> On Thu, Jun 21, 2018 at 4:33 PM, Igor Mammedov  wrote:
>>> On Thu, 21 Jun 2018 15:24:44 +0200
>>> Marc-André Lureau  wrote:
>>>  
 Hi

 On Thu, Jun 21, 2018 at 3:02 PM, Igor Mammedov  
 wrote:  
> On Tue, 15 May 2018 14:14:33 +0200
> Marc-André Lureau  wrote:
>  
>> This allows to pass the last failing test from the Windows HLK TPM 2.0
>> TCG PPI 1.3 tests.
>>
>> The interface is described in the "TCG Platform Reset Attack
>> Mitigation Specification", chapter 6 "ACPI _DSM Function". Whether or
>> not we should have a real implementation remains an open question to me. 
>>  
> might it cause security issues?  

 Good question. If the guest assumes success of this operation perhaps.
 I'll check the spec.
  
> What are implications of faking it and how hard it's to implement thing
> per spec?  

 Laszlo answerd that in "[Qemu-devel] investigating TPM for
 OVMF-on-QEMU"  2f2b) TCG Memory Clear Interface  
>>> I get that it's optional, but we probably shouldn't advertise/fake
>>> feature if it's not supported.  
>>
>> As said in the commit message, the objective was to pass the Windows
>> HLK test. If we don't want to advertize a fake interface, I am fine
>> droping this patch. We'll have to revisit with Laszlo the work needed
>> in the firmware to support it.
> I think it would be safer to drop this patch.

This is BTW a feature that's very difficult for OVMF to implement, but
(I think) near trivial for QEMU to implement. The feature is about
clearing all of the guest RAM to zero at reboot.

For the firmware, it's difficult to solve, because in the 32-bit PEI
phase, we don't map DRAM beyond 4GB, so we can't re-set memory to zero
via normal addressing. (For physical platforms, this is different,
because their PEI phases have PEI modules that initialize the memory
controller(s), so they have platform-specific means to clear RAM.) For
QEMU, on the other hand, the feature "shouldn't be hard (TM)", just
implement a reset handler that clears all RAMBlocks on the host side (or
some such :) ).

Thanks,
Laszlo

> 
> 
>>>  
  
>
>  
>> Signed-off-by: Marc-André Lureau 
>> ---
>>  hw/i386/acpi-build.c | 9 +
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index 95be4f0710..392a1e50bd 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -2072,6 +2072,15 @@ build_tpm_ppi(Aml *dev)
>>  aml_append(ifctx, aml_return(aml_buffer(1, zerobyte)));
>>  }
>>  aml_append(method, ifctx);
>> +
>> +   /* dummy MOR Memory Clear for the sake of WLK PPI test */
>> +ifctx = aml_if(
>> +aml_equal(aml_arg(0),
>> +  
>> aml_touuid("376054ED-CC13-4675-901C-4756D7F2D45D")));
>> +{
>> +aml_append(ifctx, aml_return(aml_int(0)));
>> +}
>> +aml_append(method, ifctx);
>>  }
>>  aml_append(dev, method);
>>  }  
>
>  


  
>>>  
>>
>>
>>
> 




Re: [Qemu-devel] [PATCH v5 2/4] tpm: implement virtual memory device for TPM PPI

2018-06-27 Thread Laszlo Ersek
On 06/27/18 17:05, Igor Mammedov wrote:
> On Wed, 27 Jun 2018 10:36:52 -0400
> Stefan Berger  wrote:
> 
>> On 06/27/2018 10:19 AM, Igor Mammedov wrote:
>>> On Wed, 27 Jun 2018 08:53:28 -0400
>>> Stefan Berger  wrote:
>>>  
 On 06/27/2018 07:44 AM, Igor Mammedov wrote:  
> On Tue, 26 Jun 2018 14:23:41 +0200
> Marc-André Lureau  wrote:
> 
>> From: Stefan Berger 
>>
>> Implement a virtual memory device for the TPM Physical Presence 
>> interface.
>> The memory is located at 0xFED45000 and used by ACPI to send messages to 
>> the
>> firmware (BIOS) and by the firmware to provide parameters for each one of
>> the supported codes.
>>
>> This device should be used by all TPM interfaces on x86 and can be added
>> by calling tpm_ppi_init_io().
>>
>> Signed-off-by: Stefan Berger 
>> Signed-off-by: Marc-André Lureau 
>>
>> ---
>>
>> v4 (Marc-André):
>>- pass TPM_PPI_ADDR_BASE as argument to tpm_ppi_init_io()
>>- only enable PPI if property is set
>>
>> v3 (Marc-André):
>>- merge CRB support
>>- use trace events instead of DEBUG printf
>>- headers inclusion simplification
>>
>> v2:
>>- moved to byte access since an infrequently used device;
>>  this simplifies code
>>- increase size of device to 0x400
>>- move device to 0xfffef000 since SeaBIOS has some code at 0x:
>>  'On the emulators, the bios at 0xf is also at 0x'
>> ---
>>hw/tpm/tpm_ppi.h  | 27 
>>include/hw/acpi/tpm.h |  6 +
>>hw/tpm/tpm_crb.c  |  7 ++
>>hw/tpm/tpm_ppi.c  | 57 +++
>>hw/tpm/tpm_tis.c  |  7 ++
>>hw/tpm/Makefile.objs  |  2 +-
>>hw/tpm/trace-events   |  4 +++
>>7 files changed, 109 insertions(+), 1 deletion(-)
>>create mode 100644 hw/tpm/tpm_ppi.h
>>create mode 100644 hw/tpm/tpm_ppi.c
>>
>> diff --git a/hw/tpm/tpm_ppi.h b/hw/tpm/tpm_ppi.h
>> new file mode 100644
>> index 00..ac7ad47238
>> --- /dev/null
>> +++ b/hw/tpm/tpm_ppi.h
>> @@ -0,0 +1,27 @@
>> +/*
>> + * TPM Physical Presence Interface
>> + *
>> + * Copyright (C) 2018 IBM Corporation
>> + *
>> + * Authors:
>> + *  Stefan Berger
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or 
>> later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +#ifndef TPM_TPM_PPI_H
>> +#define TPM_TPM_PPI_H
>> +
>> +#include "hw/acpi/tpm.h"
>> +#include "exec/address-spaces.h"
>> +
>> +typedef struct TPMPPI {
>> +MemoryRegion mmio;
>> +
>> +uint8_t ram[TPM_PPI_ADDR_SIZE];
>> +} TPMPPI;  
> I probably miss something obvious here,
> 1st:
> commit message says that memory reion is supposed to be interface
> between FW and OSPM (i.e. totally guest internal thingy).
> So question is:
> why do we register memory region at all?  
 One reason for the device itself was being able to debug the interaction
 of the guest with ACPI though I had additional instrumentation for that
 showing register contents.
 We need it to have some memory in the region where we place it. I
 suppose a memory_region_init_ram() would provide migration support
 automatically but cannot be used on memory where we have
 MemoryRegionOps. So we could drop most parts of the device and only run
 memory_region_init_ram() ?  
>>> if QEMU doesn't need to touch it ever, you could do even better.  
>>
>> QEMU does indirectly touch it in 4/4 where we define the 
>> OperationRegion()s and need to know where they are located in memory. We 
>> could read the base address that is now TPM_PPI_ADDR_BASE from a hard 
>> coded memory location
> that's done for you by bios_linker_loader_add_pointer() when
> ACPI tables are installed by FW.
> 
>> and pass it into OperationRegion(), but I doubt we 
>> would want that.
>>
>> +aml_append(dev,
>> +   aml_operation_region("TPP1", AML_SYSTEM_MEMORY,
>> +aml_int(TPM_PPI_ADDR_BASE), 0x100));
>>
>>
>> +aml_append(dev,
>> +   aml_operation_region("TPP2", AML_SYSTEM_MEMORY,
>> +aml_int(TPM_PPI_ADDR_BASE + 0x100),
>> +0x5A));
> that's possible, usually it works as dynamic memory region where region
> lives within scope of a method.
> 
> but scratch it.
> As Andre pointed out reserved memory should stay at the same place
> across reboots and might be needed before ACPI tables are installed,
> which probably is impossible.
> 
> CCing Laszlo just in case if I'm wrong.

Stability of reserved memory areas is only guaranteed across S3 resume.
Through a normal reboot, all DRAM is consi

Re: [Qemu-devel] [PATCH] hw/arm: Add SBSA machine type

2018-06-28 Thread Laszlo Ersek
On 06/27/18 12:13, Hongbo Zhang wrote:
> This patch introduces a new Arm machine type 'SBSA' with features:
>  - Based on legacy 'virt' machine type.

My understanding is that this new machine type serves a different use
case than the "virt" machine type; i.e. it is not primarily meant to run
on KVM and execute virtualization workloads. Instead, it is meant to
provide an environment as faithful as possible to physical hardware, for
supporting firmware and OS development for physical ARM64 machines.

In other words, this machine type is not a "goal" for running production
workloads (on KVM); instead it is a development and testbed "means",
where the end-goal is writing OS and firmware code for physical machines
that conform to the SBSA spec. Thus, the machine type is similar to e.g.
the "vexpress" machine types, except the emulated hardware platform is
"SBSA".

Can you please confirm that?

If that's the case, then please remove the word "legacy" from the commit
message (multiple instances). This machine type does not obsolete or
replace "virt", for "virt"'s stated use case; this machine type is for
an independent use case.

One consequence of this would be that performance isn't as important.

Another consequence is that paravirt devices are less desirable.

I have another question:

>  - Newly designed memory map.
>  - EL2 and EL3 are enabled by default.
>  - AHCI controller attached to system bus, and then CDROM and hard disc
>can be added to it.
>  - EHCI controller attached to system bus, with USB mouse and key board
>installed by default.
>  - E1000 ethernet card on PCIE bus.
>  - VGA display adaptor on PCIE bus.
>  - Default CPU type cortex-a57, 4 cores, and 1G bytes memory.
>  - No virtio functions enabled, since this is to emulate real hardware.
> 
> This is the prototype, more features can be added in futrue.
> 
> Purpose of this is to have a standard QEMU platform for Arm firmware
> developement etc. where the legacy machines cannot meets requirements.
> 
> Arm Trusted Firmware and UEFI porting to this are done seperately.
> 
> Signed-off-by: Hongbo Zhang 
> ---
>  hw/arm/virt-acpi-build.c |  59 +-
>  hw/arm/virt.c| 196 
> ++-
>  include/hw/arm/virt.h|   3 +
>  3 files changed, 254 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6ea47e2..60af414 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -84,6 +84,52 @@ static void acpi_dsdt_add_uart(Aml *scope, const 
> MemMapEntry *uart_memmap,
>  aml_append(scope, dev);
>  }
>  
> +static void acpi_dsdt_add_ahci(Aml *scope, const MemMapEntry *ahci_memmap,
> +   uint32_t ahci_irq)
> +{
> +Aml *dev = aml_device("AHC0");
> +aml_append(dev, aml_name_decl("_HID", aml_string("LNRO001E")));
> +aml_append(dev, aml_name_decl("_UID", aml_int(0)));
> +aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
> +
> +Aml *crs = aml_resource_template();
> +aml_append(crs, aml_memory32_fixed(ahci_memmap->base,
> +   ahci_memmap->size, AML_READ_WRITE));
> +aml_append(crs,
> +   aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
> + AML_EXCLUSIVE, &ahci_irq, 1));
> +aml_append(dev, aml_name_decl("_CRS", crs));
> +
> +Aml *pkg = aml_package(3);
> +aml_append(pkg, aml_int(0x1));
> +aml_append(pkg, aml_int(0x6));
> +aml_append(pkg, aml_int(0x1));
> +
> +/* append the SATA class id */
> +aml_append(dev, aml_name_decl("_CLS", pkg));
> +
> +aml_append(scope, dev);
> +}
> +
> +static void acpi_dsdt_add_ehci(Aml *scope, const MemMapEntry *ehci_memmap,
> +   uint32_t ehci_irq)
> +{
> +Aml *dev = aml_device("EHC0");
> +aml_append(dev, aml_name_decl("_HID", aml_string("PNP0D20")));
> +aml_append(dev, aml_name_decl("_UID", aml_int(0)));
> +aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
> +
> +Aml *crs = aml_resource_template();
> +aml_append(crs, aml_memory32_fixed(ehci_memmap->base,
> +   ehci_memmap->size, AML_READ_WRITE));
> +aml_append(crs,
> +   aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
> + AML_EXCLUSIVE, &ehci_irq, 1));
> +aml_append(dev, aml_name_decl("_CRS", crs));
> +
> +aml_append(scope, dev);
> +}
> +
>  static void acpi_dsdt_add_fw_cfg(Aml *scope, const MemMapEntry 
> *fw_cfg_memmap)
>  {
>  Aml *dev = aml_device("FWCF");
> @@ -768,14 +814,23 @@ build_dsdt(GArray *table_data, BIOSLinker *linker, 
> VirtMachineState *vms)
> (irqmap[VIRT_UART] + ARM_SPI_BASE));
>  acpi_dsdt_add_flash(scope, &memmap[VIRT_FLASH]);
>  acpi_dsdt_add_fw_cfg(scope, &memmap[VIRT_FW_CFG]);
> -acpi_dsdt_add_virtio(scope, &memmap[VIRT_MMIO],
> -  

Re: [Qemu-devel] [PATCH 1/2] sysbus: always allow explicit_ofw_unit_address() to override address generation

2018-06-28 Thread Laszlo Ersek
On 06/27/18 21:59, Mark Cave-Ayland wrote:
> On 25/06/18 08:32, Laszlo Ersek wrote:
> 
> Hi Laszlo,
> 
>>> As any class defining explicit_ofw_unit_address() has explicitly
>>> requested a
>>> specialised behaviour then it should be used in preference to the
>>> default
>>> implementation rather than being used as a fallback.
>>
>> I disagree about the last paragraph, when put like this. I don't
>> disagree with the *goal* of the patch, however the original
>> justification for explicit_ofw_unit_address() was different.
>>
>> It was meant as a fallback for distinguishing sysbus devices when those
>> sysbus devices had neither MMIO nor PIO resources. The issue wasn't that
>> MMIO/PIO-based identification was not "right", the issue was that unique
>> identification was impossible in the absence of such resources. Please
>> see commit 0b336b3b98d8 ("hw/core: explicit OFW unit address callback
>> for SysBusDeviceClass", 2015-06-23).
>>
>> I don't have anything against repurposing explicit_ofw_unit_address()
>> like this -- as long as you check that it doesn't change behavior for
>> existing devices -- it's just that we shouldn't justify the new purpose
>> with the original intent. The original intent was different.
>>
>> I suggest stating, "we can have explicit_ofw_unit_address() take
>> priority in a backwards-compatible manner, because no sysbus device
>> currently has both explicit_ofw_unit_address() and MMIO/PIO resources".
> 
> Thanks for the feedback, I'm more than happy to update the commit
> message to better describe the original intent of the patch. How does
> the following sound to you?
> 
> 
> Some SysBusDevices either use sysbus_init_mmio() without
> sysbus_mmio_map() or the first MMIO memory region doesn't represent the
> bus address, causing a firmware device path with an invalid address to
> be generated.
> 
> SysBusDeviceClass does provide a virtual explicit_ofw_unit_address()
> method that can be used to override this process, but it is only
> considered as a fallback option meaning that any existing MMIO memory
> regions still take priority whilst determining the firmware device address.

s/is only considered as/was originally intended only as/

and then it looks great to me.

Thank you!
Laszlo

> There is currently only one user of explicit_ofw_unit_address() and that
> is the PCI expander bridge (PXB) device which has no MMIO/PIO resources
> defined. This enables us to allow explicit_ofw_unit_address() to take
> priority without affecting backwards compatibility, allowing the address
> to be customised as required.
> 
>> (Obviously checking the validity of this statement is up to you; I'm
>> just suggesting what I'd see as one more precise explanation.)
>  Yes, it seems correct to me - grep tells me the PXB device is the only
> user of explicit_ofw_unit_address() in the whole code base, and there
> are no sysbus_init_*() functions anywhere within pci_expander_bridge.c.
> 
> 
> ATB,
> 
> Mark.




Re: [Qemu-devel] [PATCH v2 1/2] sysbus: always allow explicit_ofw_unit_address() to override address generation

2018-06-29 Thread Laszlo Ersek
On 06/29/18 15:56, Mark Cave-Ayland wrote:
> Some SysBusDevices either use sysbus_init_mmio() without
> sysbus_mmio_map() or the first MMIO memory region doesn't represent the
> bus address, causing a firmware device path with an invalid address to
> be generated.
> 
> SysBusDeviceClass does provide a virtual explicit_ofw_unit_address()
> method that can be used to override this process, but it was originally 
> intended
> only as as a fallback option meaning that any existing MMIO memory regions 
> still
> take priority whilst determining the firmware device address.
> 
> There is currently only one user of explicit_ofw_unit_address() and that
> is the PCI expander bridge (PXB) device which has no MMIO/PIO resources
> defined. This enables us to allow explicit_ofw_unit_address() to take
> priority without affecting backwards compatibility, allowing the address
> to be customised as required.
> 
> Signed-off-by: Mark Cave-Ayland 
> ---
>  hw/core/sysbus.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
> index ecfb0cfc0e..1ee0c162f4 100644
> --- a/hw/core/sysbus.c
> +++ b/hw/core/sysbus.c
> @@ -293,16 +293,8 @@ static char *sysbus_get_fw_dev_path(DeviceState *dev)
>  {
>  SysBusDevice *s = SYS_BUS_DEVICE(dev);
>  SysBusDeviceClass *sbc = SYS_BUS_DEVICE_GET_CLASS(s);
> -/* for the explicit unit address fallback case: */
>  char *addr, *fw_dev_path;
>  
> -if (s->num_mmio) {
> -return g_strdup_printf("%s@" TARGET_FMT_plx, qdev_fw_name(dev),
> -   s->mmio[0].addr);
> -}
> -if (s->num_pio) {
> -return g_strdup_printf("%s@i%04x", qdev_fw_name(dev), s->pio[0]);
> -}
>  if (sbc->explicit_ofw_unit_address) {
>  addr = sbc->explicit_ofw_unit_address(s);
>  if (addr) {
> @@ -311,6 +303,13 @@ static char *sysbus_get_fw_dev_path(DeviceState *dev)
>  return fw_dev_path;
>  }
>  }
> +if (s->num_mmio) {
> +return g_strdup_printf("%s@" TARGET_FMT_plx, qdev_fw_name(dev),
> +   s->mmio[0].addr);
> +}
> +if (s->num_pio) {
> +return g_strdup_printf("%s@i%04x", qdev_fw_name(dev), s->pio[0]);
> +}
>  return g_strdup(qdev_fw_name(dev));
>  }
>  
> 


Reviewed-by: Laszlo Ersek 



  1   2   3   4   5   6   7   8   9   10   >