Re: [Qemu-devel] [Bug 1661386] Re: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed

2017-07-23 Thread Matwey V. Kornilov
2017-02-08 11:49 GMT+03:00 Paolo Bonzini :
>> Does qemu follow recommendations from section 4.3?
>
> All that QEMU does is initialize MSR values and QEMU is talking to KVM,
> not to the processor; KVM in turn talks to the host kernel's perf
> subsystem.
>
> It's the host kernel's perf subsystem that needs to follow Intel's
> recommendation.  In particular, QEMU is setting CPUID to the values
> retrieved by
>
> perf_get_x86_pmu_capability(&cap);

I can not find this function mentioned in qemu master sources.

The only thing I see is that has_msr_architectural_pmu is set to be
true in kvm_arch_init_vcpu() if 0xA EAX has non-zero version. This is
not enough according to the Intel specs.

>
> so perhaps it's perf_get_x86_pmu_capability that misreads the
> performance monitoring capabilities provided by ESX.  Please attach
> dmesg logs from starting the host with loglevel=9, as well as "x86info
> -a" output from the host, to see if perf misses some problematic
> CPUID/MSR combination.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1661386
>
> Title:
>   Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed
>
> Status in QEMU:
>   New
>
> Bug description:
>   Hello,
>
>
>   I see the following when try to run qemu from master as the following:
>
>   # ./x86_64-softmmu/qemu-system-x86_64 --version
>   QEMU emulator version 2.8.50 (v2.8.0-1006-g4e9f524)
>   Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
>   # ./x86_64-softmmu/qemu-system-x86_64 -machine accel=kvm -nodefaults
>   -no-reboot -nographic -cpu host -vga none  -kernel .build.kernel.kvm
>   -initrd .build.initrd.kvm -append 'panic=1 no-kvmclock console=ttyS0
>   loglevel=7' -m 1024 -serial stdio
>   qemu-system-x86_64: /home/matwey/lab/qemu/target/i386/kvm.c:1849:
>   kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
>
>   First broken commit has been bisected:
>
>   commit 48e1a45c3166d659f781171a47dabf4a187ed7a5
>   Author: Paolo Bonzini 
>   Date:   Wed Mar 30 22:55:29 2016 +0200
>
>   target-i386: assert that KVM_GET/SET_MSRS can set all requested MSRs
>
>   This would have caught the bug in the previous patch.
>
>   Signed-off-by: Paolo Bonzini 
>
>   My cpuinfo is the following:
>
>   processor   : 0
>   vendor_id   : GenuineIntel
>   cpu family  : 6
>   model   : 44
>   model name  : Intel(R) Xeon(R) CPU   X5675  @ 3.07GHz
>   stepping: 2
>   microcode   : 0x14
>   cpu MHz : 3066.775
>   cache size  : 12288 KB
>   physical id : 0
>   siblings: 2
>   core id : 0
>   cpu cores   : 2
>   apicid  : 0
>   initial apicid  : 0
>   fpu : yes
>   fpu_exception   : yes
>   cpuid level : 11
>   wp  : yes
>   flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm 
> constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc 
> aperfmperf pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor 
> lahf_lm ida arat epb dtherm tpr_shadow vnmi ept vpid
>   bugs:
>   bogomips: 6133.55
>   clflush size: 64
>   cache_alignment : 64
>   address sizes   : 40 bits physical, 48 bits virtual
>   power management:
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1661386/+subscriptions


-- 
With best regards,
Matwey V. Kornilov

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1661386

Title:
  Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed

Status in QEMU:
  New

Bug description:
  Hello,

  
  I see the following when try to run qemu from master as the following:

  # ./x86_64-softmmu/qemu-system-x86_64 --version
  QEMU emulator version 2.8.50 (v2.8.0-1006-g4e9f524)
  Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
  # ./x86_64-softmmu/qemu-system-x86_64 -machine accel=kvm -nodefaults
  -no-reboot -nographic -cpu host -vga none  -kernel .build.kernel.kvm
  -initrd .build.initrd.kvm -append 'panic=1 no-kvmclock console=ttyS0
  loglevel=7' -m 1024 -serial stdio
  qemu-system-x86_64: /home/matwey/lab/qemu/target/i386/kvm.c:1849:
  kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

  First broken commit has been bisected:

  commit 48e1a45c3166d659f781171a47dabf4a187ed7a5
  Author: Paolo Bonzini 
  Date:   Wed Mar 30 22:55:29 2016 +0200

  target-i386: assert that KVM_GET/SET_MSRS can set all requested MSRs
  
  This would have caught the bug in the previous patch.
  
  Signed-off-by: Paolo Bonzini 

  My cpuinfo is the following:

  processor   : 0
  vendor_id   : GenuineIntel
  cpu family  : 6
  model   : 44
  model name  : Intel(R) Xeon(R) CPU   X5675  @ 3.07GHz
 

Re: [Qemu-devel] [PATCH] vhost-user: fix watcher need be removed when vhost-user hotplug

2017-07-23 Thread Marc-André Lureau
Hi

On Sun, Jul 23, 2017 at 4:12 AM, Michael S. Tsirkin  wrote:
> On Sat, Jul 22, 2017 at 09:24:27AM +, Marc-André Lureau wrote:
>>
>>
>> On Sat, Jul 22, 2017 at 2:35 AM Michael S. Tsirkin  wrote:
>>
>> On Fri, Jul 21, 2017 at 11:19:04AM +, Marc-André Lureau wrote:
>> > Hi
>> >
>> > On Fri, Jul 21, 2017 at 7:18 AM w00273186  
>> wrote:
>> >
>> > From: Yunjian Wang 
>> >
>> > "nc" is freed after hotplug vhost-user, but the watcher don't be
>> removed.
>> > The QEMU crash when the watcher access the "nc" on socket 
>> disconnect.
>> >
>> >
>> >
>> > This is actually your 3rd iteration on the patch
>> >
>> > Could your describe your changes since:
>> > "[PATCH v2] vhost-user: fix watcher need be removed when vhost-user
>> hotplug"
>> >
>> > Thanks
>>
>> Yes but it's a 3-liner. That's way below the limit where you need
>> detailed change history. Does the patch make sense to you?
>>
>>
>>
>> That's not all, the fact that he didn't come up with the same solution in the
>> first place, and I didn't notice a problem either with the previous approach 
>> is
>> enough to ask from some clarification on which approach is best, and I bet
>> there is something to say.
>
> I'm rather confused.  Looks like you were the one who asked for the change.
> Really we want to attract new contributors and a small bugfix like this
> seems like a very good way to start contributing. Changelog is already
> 3 times the size of the patch here. So I think we should just get the patch
> reviewed and applied if correct. Do you plan to review it?

Indeed, but I totally forgot.

This situation wouldn't happen if:
- the patch was version v3
- the patch/mail would have been annotated after  --- to quickly
describe the change
- I had better memory...


>
>> Furthermore, we would really benefit from having repeatable cases for this 
>> kind
>> of fixes.
>
> I agree disconnect path is but tested adequately but I don't think we
> are at a point where we should be asking for testcases for every use
> after free bug that gets fixed.

Not to write a test case, but at least to document what triggered this
path. Since Yunjian gave it in the previous reply, and I forgot that
too, it would be best to have it in the commit message, agree?




-- 
Marc-André Lureau



Re: [Qemu-devel] [Bug 1661386] Re: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed

2017-07-23 Thread Matwey V. Kornilov
2017-07-23 12:54 GMT+03:00 Matwey V. Kornilov :
> 2017-02-08 11:49 GMT+03:00 Paolo Bonzini :
>>> Does qemu follow recommendations from section 4.3?
>>
>> All that QEMU does is initialize MSR values and QEMU is talking to KVM,
>> not to the processor; KVM in turn talks to the host kernel's perf
>> subsystem.
>>
>> It's the host kernel's perf subsystem that needs to follow Intel's
>> recommendation.  In particular, QEMU is setting CPUID to the values
>> retrieved by
>>
>> perf_get_x86_pmu_capability(&cap);
>
> I can not find this function mentioned in qemu master sources.
>

Ok, I found this place in kvm kernel module. But it doesn't do what
you expect it to do. It just reassembles 0xA EAX from previously
parsed data.
IA32_MISC_ENABLE is not accessed anywhere here.

> The only thing I see is that has_msr_architectural_pmu is set to be
> true in kvm_arch_init_vcpu() if 0xA EAX has non-zero version. This is
> not enough according to the Intel specs.
>
>>
>> so perhaps it's perf_get_x86_pmu_capability that misreads the
>> performance monitoring capabilities provided by ESX.  Please attach
>> dmesg logs from starting the host with loglevel=9, as well as "x86info
>> -a" output from the host, to see if perf misses some problematic
>> CPUID/MSR combination.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1661386
>>
>> Title:
>>   Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed
>>
>> Status in QEMU:
>>   New
>>
>> Bug description:
>>   Hello,
>>
>>
>>   I see the following when try to run qemu from master as the following:
>>
>>   # ./x86_64-softmmu/qemu-system-x86_64 --version
>>   QEMU emulator version 2.8.50 (v2.8.0-1006-g4e9f524)
>>   Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
>>   # ./x86_64-softmmu/qemu-system-x86_64 -machine accel=kvm -nodefaults
>>   -no-reboot -nographic -cpu host -vga none  -kernel .build.kernel.kvm
>>   -initrd .build.initrd.kvm -append 'panic=1 no-kvmclock console=ttyS0
>>   loglevel=7' -m 1024 -serial stdio
>>   qemu-system-x86_64: /home/matwey/lab/qemu/target/i386/kvm.c:1849:
>>   kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
>>
>>   First broken commit has been bisected:
>>
>>   commit 48e1a45c3166d659f781171a47dabf4a187ed7a5
>>   Author: Paolo Bonzini 
>>   Date:   Wed Mar 30 22:55:29 2016 +0200
>>
>>   target-i386: assert that KVM_GET/SET_MSRS can set all requested MSRs
>>
>>   This would have caught the bug in the previous patch.
>>
>>   Signed-off-by: Paolo Bonzini 
>>
>>   My cpuinfo is the following:
>>
>>   processor   : 0
>>   vendor_id   : GenuineIntel
>>   cpu family  : 6
>>   model   : 44
>>   model name  : Intel(R) Xeon(R) CPU   X5675  @ 3.07GHz
>>   stepping: 2
>>   microcode   : 0x14
>>   cpu MHz : 3066.775
>>   cache size  : 12288 KB
>>   physical id : 0
>>   siblings: 2
>>   core id : 0
>>   cpu cores   : 2
>>   apicid  : 0
>>   initial apicid  : 0
>>   fpu : yes
>>   fpu_exception   : yes
>>   cpuid level : 11
>>   wp  : yes
>>   flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
>> cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm 
>> constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc 
>> aperfmperf pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor 
>> lahf_lm ida arat epb dtherm tpr_shadow vnmi ept vpid
>>   bugs:
>>   bogomips: 6133.55
>>   clflush size: 64
>>   cache_alignment : 64
>>   address sizes   : 40 bits physical, 48 bits virtual
>>   power management:
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/qemu/+bug/1661386/+subscriptions
>
>
>
> --
> With best regards,
> Matwey V. Kornilov


-- 
With best regards,
Matwey V. Kornilov

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1661386

Title:
  Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed

Status in QEMU:
  New

Bug description:
  Hello,

  
  I see the following when try to run qemu from master as the following:

  # ./x86_64-softmmu/qemu-system-x86_64 --version
  QEMU emulator version 2.8.50 (v2.8.0-1006-g4e9f524)
  Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers
  # ./x86_64-softmmu/qemu-system-x86_64 -machine accel=kvm -nodefaults
  -no-reboot -nographic -cpu host -vga none  -kernel .build.kernel.kvm
  -initrd .build.initrd.kvm -append 'panic=1 no-kvmclock console=ttyS0
  loglevel=7' -m 1024 -serial stdio
  qemu-system-x86_64: /home/matwey/lab/qemu/target/i386/kvm.c:1849:
  kvm_put_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.

  First broken commit has been bisected:

  commit 48e1a45c3166d659f781171a47dabf4a187ed7a5
  Author: Paolo Bonzini 
  Date:   Wed Mar 30 22:55:

Re: [Qemu-devel] [RFC PATCH v2 5/6] hw/pci: add bus_reserve property to pcie-root-port

2017-07-23 Thread Michael S. Tsirkin
On Sun, Jul 23, 2017 at 01:15:42AM +0300, Aleksandr Bezzubikov wrote:
> To enable hotplugging of a newly created pcie-pci-bridge,
> we need to tell firmware (SeaBIOS in this case)

Presumably, EFI would need to support this too?

> to reserve
> additional buses for pcie-root-port, that allows us to 
> hotplug pcie-pci-bridge into this root port.
> The number of buses to reserve is provided to the device via a corresponding
> property, and to the firmware via new PCI capability (next patch).
> The property's default value is 1 as we want to hotplug at least 1 bridge.

If so you should just teach firmware to allocate one bus #
unconditionally.

But why would that be so? What's wrong with a device
directly in the root port?


> 
> Signed-off-by: Aleksandr Bezzubikov 
> ---
>  hw/pci-bridge/pcie_root_port.c | 1 +
>  include/hw/pci/pcie_port.h | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
> index 4d588cb..b0e49e1 100644
> --- a/hw/pci-bridge/pcie_root_port.c
> +++ b/hw/pci-bridge/pcie_root_port.c
> @@ -137,6 +137,7 @@ static void rp_exit(PCIDevice *d)
>  static Property rp_props[] = {
>  DEFINE_PROP_BIT(COMPAT_PROP_PCP, PCIDevice, cap_present,
>  QEMU_PCIE_SLTCAP_PCP_BITNR, true),
> +DEFINE_PROP_UINT8("bus_reserve", PCIEPort, bus_reserve, 1),
>  DEFINE_PROP_END_OF_LIST()
>  };
>  
> diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h
> index 1333266..1b2dd1f 100644
> --- a/include/hw/pci/pcie_port.h
> +++ b/include/hw/pci/pcie_port.h
> @@ -34,6 +34,9 @@ struct PCIEPort {
>  
>  /* pci express switch port */
>  uint8_t port;
> +
> +/* additional buses to reserve on firmware init */
> +uint8_t bus_reserve;
>  };
>  
>  void pcie_port_init_reg(PCIDevice *d);

So here is a property and it does not do anything.
It makes it easier to work on series maybe, but review
is harder since we do not see what it does at all.
Please do not split up patches like this - you can maintain
it split up in your branch if you like and merge before sending.

> -- 
> 2.7.4



Re: [Qemu-devel] [PULL 5/6] console: remove do_safe_dpy_refresh

2017-07-23 Thread Paolo Bonzini
On 18/07/2017 15:56, Laurent Vivier wrote:
> On 18/07/2017 15:07, Laurent Vivier wrote:
>> On 21/06/2017 15:23, Gerd Hoffmann wrote:
>>> Drop the temporary workaround for the broken display updates.
>>> All display adapters are updated, so this should be safe without
>>> causing regressions.
>>
>> It seems it breaks QMP command 'migrate "exec:cat>mig"'.
>>
>> The command hangs and doesn't create the file.
>>
>> It happens with qemu-system-ppc64 on x86 (so TCG mode).
>>
>> my command:
>>
>>./ppc64-softmmu/qemu-system-ppc64 -serial mon:stdio
>>
>> I wait SLOF fails to find an OS, and:
>>
>> Ctrl-a c
>> (qemu) migrate -d "exec:cat>mig"
>>
>> The file is not created and the command hangs:
>>
>> #0  in __lll_lock_wait
>> #1  in pthread_mutex_lock
>> #2  in qemu_mutex_lock
>> #3  in rcu_init_lock
>> #4  in fork
>> #5  in qemu_fork
>> #6  in qio_channel_command_new_spawn
>> #7  in exec_start_outgoing_migration
>> #8  in qmp_migrate
>> ...
>>
>> It looks like a deadlock.
> 
> I think this patch is not the cause of the problem, the one it removes
> just unlocks the deadlock by playing with locks.
> 
> We have a rcu_init_lock() on fork() because of:
> 
> utils/rcu.c:
> 
> static void __attribute__((__constructor__)) rcu_init(void)
> {
> #ifdef CONFIG_POSIX
> pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
> #endif
> rcu_init_complete();
> }
> 
> The QMP thread hangs on:
> 
> (gdb) p rcu_sync_lock
> $1 = {lock = {__data = {__lock = 2, __count = 0, __owner = 23865,
>   __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {
> __prev = 0x0, __next = 0x0}},
> __size = "\002\000\000\000\000\000\000\000\071]\000\000\001", '\000'
> , __align = 2}, initialized = true}
> 
> 
> The lock is already taken by thread 2:
> 
> (gdb) info thread
>   Id   Target Id Frame
>   1Thread 0x7f1cf02fdf00 (LWP 23864) "qemu-system-ppc"
> 0x7f1cd914b37d in __lll_lock_wait () from /lib64/libpthread.so.0
> * 2Thread 0x7f1cc9762700 (LWP 23865) "qemu-system-ppc"
> 0x7f1cd410daa9 in syscall () from /lib64/libc.so.6
>   3Thread 0x7f1cbf8d5700 (LWP 23866) "qemu-system-ppc"
> 0x7f1cd914b37d in __lll_lock_wait () from /lib64/libpthread.so.0
> 
> (gdb) bt
> #0  0x7f1cd410daa9 in syscall () at /lib64/libc.so.6
> #1  0x55ab028ddda2 in qemu_futex_wait
> #2  0x55ab028ddda2 in qemu_event_wait
> #3  0x55ab028eda2b in wait_for_readers
> #4  0x55ab028eda2b in synchronize_rcu
> #5  0x55ab028edc5b in call_rcu_thread
> #6  0x7f1cd914273a in start_thread ()
> #7  0x7f1cd4113e0f in clone ()
> 
> So it seems we cannot fork() from QMP?
> [cc: Paolo]

There have been other similar bugs, as David reported.  The plan was to
disable pthread_atfork soon after daemonize (basically assuming that
after daemonize fork is immediately followed by exec), but I've been
lazy and never finished those patches.  Looks like it's time.

Paolo



[Qemu-devel] [Bug 1089005] Re: Qemu does not shutdown with vnc enabled on OS X

2017-07-23 Thread Rida Shamasneh
the issue is still there
i am using 2.9.50
i want to stop a running process when i press ctrl-c and not stop the whole qemu

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1089005

Title:
  Qemu does not shutdown with vnc enabled on OS X

Status in QEMU:
  Expired

Bug description:
  I am running OS X 10.8.2 and Qemu 1.3.50 from your git repository.

  Running

  qemu-system-i386 

  works fine. I can quit the process using ctrl-c.

  When I try to use

   qemu-system-i386 -vnc : 

  ctrl-c does nothing and I have to kill the process trough the activity 
monitor.
  Furthermore terminating the process from my java program does not work 
either. 
  I have also posted a question on Stackoverflow: 
http://stackoverflow.com/questions/13798367/qemu-does-not-shutdown-with-vnc-enabled-on-os-x

  Thanks
  Leander

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1089005/+subscriptions



Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Rik van Riel
On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi  > wrote:
> >
> > Maybe the NVDIMM folks can comment on this idea.
> 
> I think it's unworkable to use the flush hints as a guest-to-host
> fsync mechanism. That mechanism was designed to flush small memory
> controller buffers, not large swaths of dirty memory. What about
> running the guests in a writethrough cache mode to avoid needing
> dirty
> cache management altogether? Either way I think you need to use
> device-dax on the host, or one of the two work-in-progress filesystem
> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
> metadata coordination between guests and the host.

The thing Pankaj is looking at is to use the DAX mechanisms
inside the guest (disk image as memory mapped nvdimm area),
with that disk image backed by a regular storage device on
the host.

The goal is to increase density of guests, by moving page
cache into the host (where it can be easily reclaimed).

If we assume the guests will be backed by relatively fast
SSDs, a "whole device flush" from filesystem journaling
code (issued where the filesystem issues a barrier or
disk cache flush today) may be just what we need to make
that work.



Re: [Qemu-devel] [PATCH 1/5] Makefile: clean: Clean exe files

2017-07-23 Thread Philippe Mathieu-Daudé

Hi Sameeh,

On 07/05/2017 04:54 AM, Sameeh Jubran wrote:

From: Sameeh Jubran 

Clean exe files such as qemu-ga.exe

Signed-off-by: Sameeh Jubran 
---
  Makefile | 1 +
  1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 16a0430..22d29d6 100644
--- a/Makefile
+++ b/Makefile
@@ -487,6 +487,7 @@ clean:
rm -f config.mak op-i386.h opc-i386.h gen-op-i386.h op-arm.h opc-arm.h 
gen-op-arm.h
rm -f qemu-options.def
rm -f *.msi
+   rm -f *${EXESUF}
find . \( -name '*.so' -o -name '*.dll' -o -name '*.mo' -o -name 
'*.[oda]' \) -type f -exec rm {} +
rm -f $(filter-out %.tlb,$(TOOLS)) $(HELPERS-y) qemu-ga TAGS cscope.* 
*.pod *~ */*~


It seems to me your problem is here, this line should be:

- rm -f $(filter-out %.tlb,$(TOOLS)) $(HELPERS-y) qemu-ga TAGS cscope.* 
*.pod *~ */*~
+ rm -f $(filter-out %.tlb,$(TOOLS)) $(HELPERS-y) qemu-ga${EXESUF} TAGS 
cscope.* *.pod *~ */*~



rm -f fsdev/*.pod



Regards,

Phil.



Re: [Qemu-devel] [RFC PATCH v2 5/6] hw/pci: add bus_reserve property to pcie-root-port

2017-07-23 Thread Marcel Apfelbaum

On 23/07/2017 15:22, Michael S. Tsirkin wrote:

On Sun, Jul 23, 2017 at 01:15:42AM +0300, Aleksandr Bezzubikov wrote:

To enable hotplugging of a newly created pcie-pci-bridge,
we need to tell firmware (SeaBIOS in this case)




Hi Michael,


Presumably, EFI would need to support this too?



Sure, Eduardo added to CC, but he is in PTO now.


to reserve
additional buses for pcie-root-port, that allows us to
hotplug pcie-pci-bridge into this root port.
The number of buses to reserve is provided to the device via a corresponding
property, and to the firmware via new PCI capability (next patch).
The property's default value is 1 as we want to hotplug at least 1 bridge.


If so you should just teach firmware to allocate one bus #
unconditionally.



That would be a problem for the PCIe machines, since each PCIe
devices is plugged in a different bus and we are already
limited to 256 PCIe devices. Allocating an extra-bus always
would really limit the PCIe devices we can use.


But why would that be so? What's wrong with a device
directly in the root port?



First, plugging a legacy PCI device into a PCIe Root Port
looks strange at least, and it can;t be done on real HW anyway.
(incompatible slots)

Second (and more important), if we want 2 or more PCI
devices we would loose both IO ports space and bus numbers.





Signed-off-by: Aleksandr Bezzubikov 
---
  hw/pci-bridge/pcie_root_port.c | 1 +
  include/hw/pci/pcie_port.h | 3 +++
  2 files changed, 4 insertions(+)

diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
index 4d588cb..b0e49e1 100644
--- a/hw/pci-bridge/pcie_root_port.c
+++ b/hw/pci-bridge/pcie_root_port.c
@@ -137,6 +137,7 @@ static void rp_exit(PCIDevice *d)
  static Property rp_props[] = {
  DEFINE_PROP_BIT(COMPAT_PROP_PCP, PCIDevice, cap_present,
  QEMU_PCIE_SLTCAP_PCP_BITNR, true),
+DEFINE_PROP_UINT8("bus_reserve", PCIEPort, bus_reserve, 1),
  DEFINE_PROP_END_OF_LIST()
  };
  
diff --git a/include/hw/pci/pcie_port.h b/include/hw/pci/pcie_port.h

index 1333266..1b2dd1f 100644
--- a/include/hw/pci/pcie_port.h
+++ b/include/hw/pci/pcie_port.h
@@ -34,6 +34,9 @@ struct PCIEPort {
  
  /* pci express switch port */

  uint8_t port;
+
+/* additional buses to reserve on firmware init */
+uint8_t bus_reserve;
  };
  
  void pcie_port_init_reg(PCIDevice *d);


So here is a property and it does not do anything.
It makes it easier to work on series maybe, but review
is harder since we do not see what it does at all.
Please do not split up patches like this - you can maintain
it split up in your branch if you like and merge before sending.



Agreed, Alexandr please merge patches 4-5-6 for your next submission.

Thanks,
Marcel



--
2.7.4





Re: [Qemu-devel] [PATCH] qemu-iotests: add a "how to" to ./README

2017-07-23 Thread Philippe Mathieu-Daudé

On 07/21/2017 12:51 PM, Stefan Hajnoczi wrote:

On Fri, Jul 21, 2017 at 07:16:34AM -0500, Eric Blake wrote:

On 07/21/2017 04:34 AM, Stefan Hajnoczi wrote:

There is not much getting started documentation for qemu-iotests.  This
patch explains how to create a new test and covers the overall testing
approach.

Cc: Ishani Chugh 
Signed-off-by: Stefan Hajnoczi 
---



+3. Assign groups to the test
+
+Add your test to the ./group file.  This file is the index of tests and assigns
+them to functional groups like "rw" for read-write tests.  Most tests belong to
+the "rw" and "auto" groups.  "auto" means the test runs when ./check is invoked
+without a -g argument.
+
+Consider adding your test to the "quick" group if it executes quickly (<1s).


We have several tests going up to 5s (and I have a patch pending to
remove two tests that took longer) - I think 1s is a bit on the short
end for still classifying a test as quick.


I'm happy to accept any number blessed by Kevin.  I do think that 1
second is a safe maximum and no one should get in trouble for adding a
test that takes 1 second to the "quick" group.


+This group is run by "make check-block" and is often included as part of build
+tests in continuous integration systems.


It would still be nice to have 'make check' run 'make check-block'...
but that's independent of this patch.


Yes, there is a separate discussion about that on the list right now.
Hopefully it will be added back.


+Once you are happy with the test output it can be used as the golden master
+with "mv .out.bad .out".  Rerun the test to verify
+that it passes.
+
+Congratulations, you've created a new test!


Maybe a reminder to 'git add' the new files, then submit the patch?


I thought about this too but decided not to get into the git and patch
submission business.  I carefully worded it to be about "creating" tests
rather than "adding" them to qemu.git because I wanted to limit the
scope of this README :).


Maybe something tiny like:

  Congratulations, you've created a new test!

  To share your test to upstream QEMU (highly recommended!) just follow
  these recommendations: http://wiki.qemu.org/Contribute/SubmitAPatch



Stefan



Reviewed-by: Philippe Mathieu-Daudé 



Re: [Qemu-devel] [Qemu-arm] [RFC PATCH for 2.11 01/23] softfloat: move existing softfloat2a into versioned directory

2017-07-23 Thread Philippe Mathieu-Daudé
Even if some people ever object to use SoftFloat3c and your series, this 
patch fits well for mainline.


On 07/20/2017 12:04 PM, Alex Bennée wrote:

This is a precursor to importing SoftFloat3c into the QEMU source
tree. As the two versions of SoftFloat present different APIs the hope
is to selectively include the version 3 header when required. So for
this change:

  - #include "fpu/softloat.h" -> #include "fpu/softfloat2a/softfloat.h"
  - fpu/softfloat* -> fpu/softfloat2a/

This is all pretty mechanical and doesn't yet change any capabilities.

Signed-off-by: Alex Bennée 


Reviewed-by: Philippe Mathieu-Daudé 


---
  Makefile.target  | 2 +-
  fpu/Makefile.objs| 1 +
  fpu/{ => softfloat2a}/softfloat-macros.h | 0
  fpu/{ => softfloat2a}/softfloat-specialize.h | 0
  fpu/{ => softfloat2a}/softfloat.c| 2 +-
  include/fpu/{ => softfloat2a}/softfloat.h| 0
  include/qemu/bswap.h | 2 +-
  linux-user/arm/nwfpe/double_cpdo.c   | 2 +-
  linux-user/arm/nwfpe/extended_cpdo.c | 2 +-
  linux-user/arm/nwfpe/fpa11.h | 2 +-
  linux-user/arm/nwfpe/fpa11_cpdt.c| 2 +-
  linux-user/arm/nwfpe/fpa11_cprt.c| 2 +-
  linux-user/arm/nwfpe/fpopcode.c  | 2 +-
  linux-user/arm/nwfpe/single_cpdo.c   | 2 +-
  scripts/analyze-inclusions   | 2 +-
  target/alpha/cpu.h   | 2 +-
  target/alpha/fpu_helper.c| 2 +-
  target/alpha/helper.c| 2 +-
  target/alpha/vax_helper.c| 2 +-
  target/arm/cpu.h | 2 +-
  target/hppa/cpu.h| 2 +-
  target/hppa/helper.c | 2 +-
  target/i386/cpu.h| 2 +-
  target/m68k/cpu.h| 2 +-
  target/microblaze/cpu.h  | 2 +-
  target/mips/cpu.h| 2 +-
  target/moxie/cpu.h   | 2 +-
  target/nios2/cpu.h   | 2 +-
  target/openrisc/cpu.h| 2 +-
  target/ppc/cpu.h | 2 +-
  target/s390x/cpu.h   | 2 +-
  target/sh4/cpu.h | 2 +-
  target/sparc/cpu.h   | 2 +-
  target/tricore/cpu.h | 2 +-
  target/unicore32/cpu.h   | 2 +-
  target/xtensa/cpu.h  | 2 +-
  36 files changed, 33 insertions(+), 32 deletions(-)
  create mode 100644 fpu/Makefile.objs
  rename fpu/{ => softfloat2a}/softfloat-macros.h (100%)
  rename fpu/{ => softfloat2a}/softfloat-specialize.h (100%)
  rename fpu/{ => softfloat2a}/softfloat.c (99%)
  rename include/fpu/{ => softfloat2a}/softfloat.h (100%)

diff --git a/Makefile.target b/Makefile.target
index 2baec9252f..0991847dbd 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -94,7 +94,7 @@ obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
  obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
  obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
  obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
-obj-y += fpu/softfloat.o
+obj-y += fpu/
  obj-y += target/$(TARGET_BASE_ARCH)/
  obj-y += disas.o
  obj-$(call notempty,$(TARGET_XML_FILES)) += gdbstub-xml.o
diff --git a/fpu/Makefile.objs b/fpu/Makefile.objs
new file mode 100644
index 00..938b4acbd0
--- /dev/null
+++ b/fpu/Makefile.objs
@@ -0,0 +1 @@
+obj-y = softfloat2a/softfloat.o
diff --git a/fpu/softfloat-macros.h b/fpu/softfloat2a/softfloat-macros.h
similarity index 100%
rename from fpu/softfloat-macros.h
rename to fpu/softfloat2a/softfloat-macros.h
diff --git a/fpu/softfloat-specialize.h b/fpu/softfloat2a/softfloat-specialize.h
similarity index 100%
rename from fpu/softfloat-specialize.h
rename to fpu/softfloat2a/softfloat-specialize.h
diff --git a/fpu/softfloat.c b/fpu/softfloat2a/softfloat.c
similarity index 99%
rename from fpu/softfloat.c
rename to fpu/softfloat2a/softfloat.c
index 433c5dad2d..59b9bc9e24 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat2a/softfloat.c
@@ -84,7 +84,7 @@ this code that are retained.
   */
  #include "qemu/osdep.h"
  
-#include "fpu/softfloat.h"

+#include "fpu/softfloat2a/softfloat.h"
  
  /* We only need stdlib for abort() */
  
diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat2a/softfloat.h

similarity index 100%
rename from include/fpu/softfloat.h
rename to include/fpu/softfloat2a/softfloat.h
diff --git a/include/qemu/bswap.h b/include/qemu/bswap.h
index 09c78fd28a..0f01edd356 100644
--- a/include/qemu/bswap.h
+++ b/include/qemu/bswap.h
@@ -1,7 +1,7 @@
  #ifndef BSWAP_H
  #define BSWAP_H
  
-#include "fpu/softfloat.h"

+#include "fpu/softfloat2a/softfloat.h"
  
  #ifdef CONFIG_MACHINE_BSWAP_H

  # include 
diff --git a/linux-user/arm/nwfpe/double_cpdo.c 
b/linux-user/arm/nwfpe/double_cpdo.c
index 1cef380852..c4824bf4d8 1006

Re: [Qemu-devel] [Qemu-block] [PATCH] block: check BlockDriverState object before dereference

2017-07-23 Thread Paolo Bonzini
On 21/07/2017 17:47, Stefan Hajnoczi wrote:
> Hmm...BlockDriverState still has bdrv_is_inserted() even though
> BlockBackend->root can be NULL?  CCing Markus in case he has thoughts on
> the BB/BDS split.
> 
> I find it weird that block-backend.c calls bdrv_inc_in_flight() and then
> bdrv_co_flush() will call it again.  Perhaps we need to do this so that
> blk_drain() works correctly but it's odd that they share the same
> counter variable.

See here:
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02305.html

> I need to count requests at the BB level because the blk_aio_*
> operations have a separate bottom half that is invoked if either 1) they
> never reach BDS (because of an error); or 2) the bdrv_co_* call
> completes without yielding.  The count must be >0 when blk_aio_*
> returns, or bdrv_drain (and thus blk_drain) won't loop.  Because
> bdrv_drain and blk_drain are conflated, the counter must be the BDS one.
> 
> In turn, the BDS counter is needed because of the lack of isolated
> sections.  The right design would be for blk_isolate_begin to call
> blk_drain on *other* BlockBackends reachable in a child-to-parent visit.
>  Instead, until that is implemented, we have bdrv_drained_begin that
> emulates that through the same-named callback, followed by a
> parent-to-child bdrv_drain that is almost always unnecessary.

The full explanation of the long-term plans and what "isolated section"
means is at
https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02016.html.
(Unfortunately, since then we've reintroduced the "double aio_poll" hack
in BDRV_POLL_WHILE...).

Paolo



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PULL 40/41] vhost-user-scsi: Introduce vhost-user-scsi host device

2017-07-23 Thread Paolo Bonzini
On 18/07/2017 18:32, Marc-André Lureau wrote:
> 
> 
> On Thu, Jun 15, 2017 at 1:20 PM Paolo Bonzini  > wrote:
> 
> From: Felipe Franciosi mailto:fel...@nutanix.com>>
> 
> This commit introduces a vhost-user device for SCSI. This is based
> on the existing vhost-scsi implementation, but done over vhost-user
> instead. It also uses a chardev to connect to the backend. Unlike
> vhost-scsi (today), VMs using vhost-user-scsi can be live migrated.
> 
> To use it, start Qemu with a command line equivalent to:
> 
> qemu-system-x86_64 \
>-chardev socket,id=vus0,path=/tmp/vus.sock \
>-device vhost-user-scsi-pci,chardev=vus0,bus=pci.0,addr=...
> 
> A separate commit presents a sample application linked with libiscsi to
> provide a backend for vhost-user-scsi.
> 
> Signed-off-by: Felipe Franciosi  >
> Message-Id: <1488479153-21203-4-git-send-email-fel...@nutanix.com
> >
> Signed-off-by: Paolo Bonzini  >
> ---
> 
> 
> I just realized the patch was missing qemu-options.hx update. Oh wait,
> virtio-scsi-pci isn't documented either?
> 
> Would it have been possible to add a vhost-user-chr=chr argument on
> virtio-scsi-pci instead of introducing a new device? Would that make
> sense (to share code, argument etc)?

No, it wouldn't make sense.

The reason is that vhost-scsi devices are configured entirely outside
QEMU, and for this reason vhost-scsi/vhost-user-scsi do not expose a
SCSI bus at the QEMU level.  They do share code (through the
virtio-scsi-common superclass), but they are fundamentally different.

The vhost-blk device probably would also be a separate class, for the
same reason.  There is no way to define a -drive option with a chardev
backend.

Even vhost-user-net is somewhat hackish in its use of -netdev in my
opinion, because it cannot implement ->receive (except for the RARP
hack) and it bypasses things such as filters and hubs.

Paolo



Re: [Qemu-devel] [PATCH 19/29] disas: use QEMU_IS_ALIGNED macro

2017-07-23 Thread Paolo Bonzini
On 18/07/2017 16:43, Thomas Huth wrote:
> On 18.07.2017 13:07, Marc-André Lureau wrote:
>> Hi
>>
>> On Mon, Jul 17, 2017 at 11:09 PM, Philippe Mathieu-Daudé
>>  wrote:
>>> Applied using the Coccinelle semantic patch 
>>> scripts/coccinelle/use_osdep.cocci
> [...]
>>> diff --git a/disas.c b/disas.c
>>> index d335c55bbf..8b59448286 100644
>>> --- a/disas.c
>>> +++ b/disas.c
>>> @@ -151,7 +151,7 @@ static int print_insn_objdump(bfd_vma pc, 
>>> disassemble_info *info,
>>>  info->read_memory_func(pc, buf, n, info);
>>>
>>>  for (i = 0; i < n; ++i) {
>>> -if (i % 32 == 0) {
>>> +if (QEMU_IS_ALIGNED(i, 32)) {
>>>  info->fprintf_func(info->stream, "\n%s: ", prefix);
>>>  }
>>>  info->fprintf_func(info->stream, "%02x", buf[i]);
> 
> This looks wrong to me. QEMU_IS_ALIGNED should be used for addresses and
> similar things. This part here is about pretty printing a hex dump.
> 
> The code should likely be converted to use qemu_hexdump() instead, I guess.

qemu_hexdump only works with FILE*.

But I agree that QEMU_IS_ALIGNED looks weird here.  I think it should
mostly be used when the argument is a pointer, to hide the cast.  Uses
for non-pointer arguments should be decided on a one-by-one basis; "is
the first argument an address" is definitely a good thing to consider.
Another might be "is the second argument a variable", because "i % n ==
0" is less efficient than "(i & (n - 1)) == 0".

Paolo


Paolo



[Qemu-devel] [Bug 1295587] Re: Temporal freeze and slowdown while using emulated sb16

2017-07-23 Thread Blake Lee
Is there any fix or workaround for this bug? Some old games won't use
the AC97 and require SB16.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1295587

Title:
  Temporal freeze and slowdown while using emulated sb16

Status in QEMU:
  New

Bug description:
  I have been carrying around this bug since previous versions and on
  different machines: When I use the -soundhw sb16 option, while playing
  any sound on the virtual machine it temporally freezes the emulated
  machine and loops the last bit of such sound effect for 1-2 minutes,
  then goes back to normal speed (until a new sound is played).

  Console shows:

   sb16: warning: command 0xf9,1 is not truly understood yet
   sb16: warning: command 0xf9,1 is not truly understood yet
  (...)
  main-loop: WARNING: I/O thread spun for 1000 iterations

  -One of my emulated machines is Windows 3.11: I managed to overrun
  this bug by switching from the local 1.5 version of the sound blaster
  driver to the 1.0, although since I updated qemu it freezes that
  machine, so I can't test if it still works.

  I am using the 1.7.90 version, but I suffered this bug for over one
  year (confirmed in version 2.0.0-rc0 too)

  this bug happens anytime I use the -soundhw sb16 switch, but the full
  command I am using in this specific case is:

  qemu-system-i386 -localtime -cpu pentium -m 32 -display sdl -vga
  cirrus -hda c.img -cdrom win95stuff.iso -net nic,model=ne2k_pci -net
  user -soundhw sb16

  This bug appears on all my machines: Pentium III running Slackware
  13.0 and freeBSD 10; Dual core T2400, both in Arch, Gentoo and
  Slackware 14.1 (all 32 bits), and a Dual core T4400 64 bits with
  Gentoo and Slackware. Same problem in all of those systems after
  compiling instead of using the distro packages

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1295587/+subscriptions



Re: [Qemu-devel] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-23 Thread Marcel Apfelbaum

On 23/07/2017 1:15, Aleksandr Bezzubikov wrote:

On PCI init PCI bridges may need some
extra info about bus number to reserve, IO, memory and
prefetchable memory limits. QEMU can provide this
with special vendor-specific PCI capability.

Sizes of limits match ones from
PCI Type 1 Configuration Space Header,
number of buses to reserve occupies only 1 byte
since it is the size of Subordinate Bus Number register.



Hi Alexandr,


Signed-off-by: Aleksandr Bezzubikov 
---
  hw/pci/pci_bridge.c | 27 +++
  include/hw/pci/pci_bridge.h | 18 ++
  2 files changed, 45 insertions(+)

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index 720119b..8ec6c2c 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -408,6 +408,33 @@ void pci_bridge_map_irq(PCIBridge *br, const char* 
bus_name,
  br->bus_name = bus_name;
  }
  
+

+int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,


Can you please rename to something like
'pci_bridge_qemu_cap_init' to be more specific?


+  uint8_t bus_reserve, uint32_t io_limit,
+  uint16_t mem_limit, uint64_t pref_limit,



I am not sure regarding "limit" suffix, this
is a reservation, not a limitation.


+  Error **errp)
+{
+size_t cap_len = sizeof(PCIBridgeQemuCap);
+PCIBridgeQemuCap cap;
+
+cap.len = cap_len;
+cap.bus_res = bus_reserve;
+cap.io_lim = io_limit & 0xFF;
+cap.io_lim_upper = io_limit >> 8 & 0x;
+cap.mem_lim = mem_limit;
+cap.pref_lim = pref_limit & 0x;
+cap.pref_lim_upper = pref_limit >> 16 & 0x;
+
+int offset = pci_add_capability(dev, PCI_CAP_ID_VNDR,
+cap_offset, cap_len, errp);
+if (offset < 0) {
+return offset;
+}
+
+memcpy(dev->config + offset + 2, (char *)&cap + 2, cap_len - 2);
+return 0;
+}
+
  static const TypeInfo pci_bridge_type_info = {
  .name = TYPE_PCI_BRIDGE,
  .parent = TYPE_PCI_DEVICE,
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index ff7cbaa..c9f642c 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -67,4 +67,22 @@ void pci_bridge_map_irq(PCIBridge *br, const char* bus_name,
  #define  PCI_BRIDGE_CTL_DISCARD_STATUS0x400   /* Discard timer status 
*/
  #define  PCI_BRIDGE_CTL_DISCARD_SERR  0x800   /* Discard timer SERR# enable */
  
+typedef struct PCIBridgeQemuCap {

+uint8_t id; /* Standard PCI capability header field */
+uint8_t next;   /* Standard PCI capability header field */
+uint8_t len;/* Standard PCI vendor-specific capability header field */
+uint8_t bus_res;
+uint32_t pref_lim_upper;
+uint16_t pref_lim;
+uint16_t mem_lim;


This 32bit IOMEM, right?


+uint16_t io_lim_upper;
+uint8_t io_lim;


Why do we need io_lim and io_lim_upper?

Thanks,
Marcel


+uint8_t padding;
+} PCIBridgeQemuCap;
+
+int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,
+  uint8_t bus_reserve, uint32_t io_limit,
+  uint16_t mem_limit, uint64_t pref_limit,
+  Error **errp);
+
  #endif /* QEMU_PCI_BRIDGE_H */






Re: [Qemu-devel] [RFC PATCH v2 2/6] hw/i386: allow SHPC for Q35 machine

2017-07-23 Thread Marcel Apfelbaum

On 23/07/2017 1:15, Aleksandr Bezzubikov wrote:

Unmask previously masked SHPC feature in _OSC method.

Signed-off-by: Aleksandr Bezzubikov 
---
  hw/i386/acpi-build.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 6b7bade..0d99585 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1850,7 +1850,7 @@ static Aml *build_q35_osc_method(void)
   * Always allow native PME, AER (no dependencies)
   * Never allow SHPC (no SHPC controller in this system)


Seems the above comment is not longer correct :)

Thanks,
Marcel


   */
-aml_append(if_ctx, aml_and(a_ctrl, aml_int(0x1D), a_ctrl));
+aml_append(if_ctx, aml_and(a_ctrl, aml_int(0x1F), a_ctrl));
  
  if_ctx2 = aml_if(aml_lnot(aml_equal(aml_arg(1), aml_int(1;

  /* Unknown revision */






Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Dan Williams
[ adding Ross and Jan ]

On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel  wrote:
> On Sat, 2017-07-22 at 12:34 -0700, Dan Williams wrote:
>> On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi > > wrote:
>> >
>> > Maybe the NVDIMM folks can comment on this idea.
>>
>> I think it's unworkable to use the flush hints as a guest-to-host
>> fsync mechanism. That mechanism was designed to flush small memory
>> controller buffers, not large swaths of dirty memory. What about
>> running the guests in a writethrough cache mode to avoid needing
>> dirty
>> cache management altogether? Either way I think you need to use
>> device-dax on the host, or one of the two work-in-progress filesystem
>> mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
>> metadata coordination between guests and the host.
>
> The thing Pankaj is looking at is to use the DAX mechanisms
> inside the guest (disk image as memory mapped nvdimm area),
> with that disk image backed by a regular storage device on
> the host.
>
> The goal is to increase density of guests, by moving page
> cache into the host (where it can be easily reclaimed).
>
> If we assume the guests will be backed by relatively fast
> SSDs, a "whole device flush" from filesystem journaling
> code (issued where the filesystem issues a barrier or
> disk cache flush today) may be just what we need to make
> that work.

Ok, apologies, I indeed had some pieces of the proposal confused.

However, it still seems like the storage interface is not capable of
expressing what is needed, because the operation that is needed is a
range flush. In the guest you want the DAX page dirty tracking to
communicate range flush information to the host, but there's no
readily available block i/o semantic that software running on top of
the fake pmem device can use to communicate with the host. Instead you
want to intercept the dax_flush() operation and turn it into a queued
request on the host.

In 4.13 we have turned this dax_flush() operation into an explicit
driver call. That seems a better interface to modify than trying to
map block-storage flush-cache / force-unit-access commands to this
host request.

The additional piece you would need to consider is whether to track
all writes in addition to mmap writes in the guest as DAX-page-cache
dirtying events, or arrange for every dax_copy_from_iter() operation()
to also queue a sync on the host, but that essentially turns the host
page cache into a pseudo write-through mode.



Re: [Qemu-devel] [RFC PATCH v2 1/4] pci: refactor pci_find_capapibilty to get bdf as the first argument instead of the whole pci_device

2017-07-23 Thread Marcel Apfelbaum

Hi Alexandr,

On 23/07/2017 1:11, Aleksandr Bezzubikov wrote:

Refactor pci_find_capability function to get bdf instead of
a whole pci_device* as the only necessary field for this function
is still bdf.
It greatly helps when we have bdf but not pci_device.


You can drop the last sentence. Other than that:


Reviewed-by: Marcel Apfelbaum 


Thanks,
Marcel



Signed-off-by: Aleksandr Bezzubikov 
---
  src/fw/pciinit.c|  4 ++--
  src/hw/pcidevice.c  | 12 ++--
  src/hw/pcidevice.h  |  2 +-
  src/hw/virtio-pci.c |  4 ++--
  4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
index 08221e6..864954f 100644
--- a/src/fw/pciinit.c
+++ b/src/fw/pciinit.c
@@ -762,7 +762,7 @@ static int pci_bus_hotplug_support(struct pci_bus *bus, u8 
pcie_cap)
  return downstream_port && slot_implemented;
  }
  
-shpc_cap = pci_find_capability(bus->bus_dev, PCI_CAP_ID_SHPC, 0);

+shpc_cap = pci_find_capability(bus->bus_dev->bdf, PCI_CAP_ID_SHPC, 0);
  return !!shpc_cap;
  }
  
@@ -844,7 +844,7 @@ static int pci_bios_check_devices(struct pci_bus *busses)

   */
  parent = &busses[0];
  int type;
-u8 pcie_cap = pci_find_capability(s->bus_dev, PCI_CAP_ID_EXP, 0);
+u8 pcie_cap = pci_find_capability(s->bus_dev->bdf, PCI_CAP_ID_EXP, 0);
  int hotplug_support = pci_bus_hotplug_support(s, pcie_cap);
  for (type = 0; type < PCI_REGION_TYPE_COUNT; type++) {
  u64 align = (type == PCI_REGION_TYPE_IO) ?
diff --git a/src/hw/pcidevice.c b/src/hw/pcidevice.c
index cfebf66..d01e27b 100644
--- a/src/hw/pcidevice.c
+++ b/src/hw/pcidevice.c
@@ -134,25 +134,25 @@ pci_find_init_device(const struct pci_device_id *ids, 
void *arg)
  return NULL;
  }
  
-u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap)

+u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap)
  {
  int i;
-u16 status = pci_config_readw(pci->bdf, PCI_STATUS);
+u16 status = pci_config_readw(bdf, PCI_STATUS);
  
  if (!(status & PCI_STATUS_CAP_LIST))

  return 0;
  
  if (cap == 0) {

  /* find first */
-cap = pci_config_readb(pci->bdf, PCI_CAPABILITY_LIST);
+cap = pci_config_readb(bdf, PCI_CAPABILITY_LIST);
  } else {
  /* find next */
-cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT);
+cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT);
  }
  for (i = 0; cap && i <= 0xff; i++) {
-if (pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_ID) == cap_id)
+if (pci_config_readb(bdf, cap + PCI_CAP_LIST_ID) == cap_id)
  return cap;
-cap = pci_config_readb(pci->bdf, cap + PCI_CAP_LIST_NEXT);
+cap = pci_config_readb(bdf, cap + PCI_CAP_LIST_NEXT);
  }
  
  return 0;

diff --git a/src/hw/pcidevice.h b/src/hw/pcidevice.h
index 354b549..adcc75a 100644
--- a/src/hw/pcidevice.h
+++ b/src/hw/pcidevice.h
@@ -69,7 +69,7 @@ int pci_init_device(const struct pci_device_id *ids
  , struct pci_device *pci, void *arg);
  struct pci_device *pci_find_init_device(const struct pci_device_id *ids
  , void *arg);
-u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap);
+u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap);
  void pci_enable_busmaster(struct pci_device *pci);
  u16 pci_enable_iobar(struct pci_device *pci, u32 addr);
  void *pci_enable_membar(struct pci_device *pci, u32 addr);
diff --git a/src/hw/virtio-pci.c b/src/hw/virtio-pci.c
index e5c2c33..4e33033 100644
--- a/src/hw/virtio-pci.c
+++ b/src/hw/virtio-pci.c
@@ -381,7 +381,7 @@ fail:
  
  void vp_init_simple(struct vp_device *vp, struct pci_device *pci)

  {
-u8 cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, 0);
+u8 cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, 0);
  struct vp_cap *vp_cap;
  const char *mode;
  u32 offset, base, mul;
@@ -479,7 +479,7 @@ void vp_init_simple(struct vp_device *vp, struct pci_device 
*pci)
  vp_cap->cap, type, vp_cap->bar, addr, offset, mode);
  }
  
-cap = pci_find_capability(pci, PCI_CAP_ID_VNDR, cap);

+cap = pci_find_capability(pci->bdf, PCI_CAP_ID_VNDR, cap);
  }
  
  if (vp->common.cap && vp->notify.cap && vp->isr.cap && vp->device.cap) {







Re: [Qemu-devel] [RFC PATCH v2 2/4] pci: add RedHat vendor ID

2017-07-23 Thread Marcel Apfelbaum

On 23/07/2017 1:11, Aleksandr Bezzubikov wrote:

Signed-off-by: Aleksandr Bezzubikov 
---
  src/hw/pci_ids.h | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/src/hw/pci_ids.h b/src/hw/pci_ids.h
index 4ac73b4..db2e694 100644
--- a/src/hw/pci_ids.h
+++ b/src/hw/pci_ids.h
@@ -2263,6 +2263,8 @@
  #define PCI_DEVICE_ID_KORENIX_JETCARDF0   0x1600
  #define PCI_DEVICE_ID_KORENIX_JETCARDF1   0x16ff
  
+#define PCI_VENDOR_ID_REDHAT 		0x1b36

+
  #define PCI_VENDOR_ID_TEKRAM  0x1de1
  #define PCI_DEVICE_ID_TEKRAM_DC2900xdc29
  



I suggest to merge this patch with patch 4
that uses it.


Thanks,
Marcel



Re: [Qemu-devel] [RFC PATCH v2 3/4] pci: add QEMU-specific PCI capability structure

2017-07-23 Thread Alexander Bezzubikov
2017-07-23 5:44 GMT+03:00 Michael S. Tsirkin :

> On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
> > On PCI init PCI bridge devices may need some
> > extra info about bus number to reserve, IO, memory and
> > prefetchable memory limits. QEMU can provide this
> > with special vendor-specific PCI capability.
> >
> > This capability is intended to be used only
> > for Red Hat PCI bridges, i.e. QEMU cooperation.
> >
> > Sizes of limits match ones from
> > PCI Type 1 Configuration Space Header,
> > number of buses to reserve occupies only 1 byte
> > since it is the size of Subordinate Bus Number register.
> >
> > Signed-off-by: Aleksandr Bezzubikov 
> > ---
> >  src/hw/pci_cap.h | 23 +++
> >  1 file changed, 23 insertions(+)
> >  create mode 100644 src/hw/pci_cap.h
> >
> > diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h
> > new file mode 100644
> > index 000..1382b0b
> > --- /dev/null
> > +++ b/src/hw/pci_cap.h
> > @@ -0,0 +1,23 @@
> > +#ifndef _PCI_CAP_H
> > +#define _PCI_CAP_H
> > +
> > +#include "types.h"
> > +
> > +struct vendor_pci_cap {
> > +u8 id;
> > +u8 next;
> > +u8 len;
> > +};
> > +
> > +struct redhat_pci_bridge_cap {
> > +struct vendor_pci_cap hdr;
>
> Hi Michael,
Thanks for the quick reply.


> You want to add some kind of identifier here after
> the header, such that more capabilities can be added
> in future without breaking this one.
>

You mean to distinguish different vendor-specific capabilities?
Agreed if so, will add it in the next version.


>
> > +u8 bus_res;
> > +u32 pref_lim_upper;
> > +u16 pref_lim;
> > +u16 mem_lim;
> > +u16 io_lim_upper;
> > +u8 io_lim;
> > +u8 padd;
>
> Please add documentation.
>
>
> > +};
> > +
> > +#endif /* _PCI_CAP_H */
> > --
> > 2.7.4
>



-- 
Alexander Bezzubikov


Re: [Qemu-devel] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-23 Thread Alexander Bezzubikov
2017-07-23 18:57 GMT+03:00 Marcel Apfelbaum :

> On 23/07/2017 1:15, Aleksandr Bezzubikov wrote:
>
>> On PCI init PCI bridges may need some
>> extra info about bus number to reserve, IO, memory and
>> prefetchable memory limits. QEMU can provide this
>> with special vendor-specific PCI capability.
>>
>> Sizes of limits match ones from
>> PCI Type 1 Configuration Space Header,
>> number of buses to reserve occupies only 1 byte
>> since it is the size of Subordinate Bus Number register.
>>
>>
> Hi Alexandr,
>
> Signed-off-by: Aleksandr Bezzubikov 
>> ---
>>   hw/pci/pci_bridge.c | 27 +++
>>   include/hw/pci/pci_bridge.h | 18 ++
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
>> index 720119b..8ec6c2c 100644
>> --- a/hw/pci/pci_bridge.c
>> +++ b/hw/pci/pci_bridge.c
>> @@ -408,6 +408,33 @@ void pci_bridge_map_irq(PCIBridge *br, const char*
>> bus_name,
>>   br->bus_name = bus_name;
>>   }
>>   +
>> +int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,
>>
>
> Can you please rename to something like
> 'pci_bridge_qemu_cap_init' to be more specific?
>
> +  uint8_t bus_reserve, uint32_t io_limit,
>> +  uint16_t mem_limit, uint64_t pref_limit,
>>
>
>
> I am not sure regarding "limit" suffix, this
> is a reservation, not a limitation.


I'd like this fields names to match actual registers which
are going to get the values.


>
>
> +  Error **errp)
>> +{
>> +size_t cap_len = sizeof(PCIBridgeQemuCap);
>> +PCIBridgeQemuCap cap;
>> +
>> +cap.len = cap_len;
>> +cap.bus_res = bus_reserve;
>> +cap.io_lim = io_limit & 0xFF;
>> +cap.io_lim_upper = io_limit >> 8 & 0x;
>> +cap.mem_lim = mem_limit;
>> +cap.pref_lim = pref_limit & 0x;
>> +cap.pref_lim_upper = pref_limit >> 16 & 0x;
>> +
>> +int offset = pci_add_capability(dev, PCI_CAP_ID_VNDR,
>> +cap_offset, cap_len, errp);
>> +if (offset < 0) {
>> +return offset;
>> +}
>> +
>> +memcpy(dev->config + offset + 2, (char *)&cap + 2, cap_len - 2);
>> +return 0;
>> +}
>> +
>>   static const TypeInfo pci_bridge_type_info = {
>>   .name = TYPE_PCI_BRIDGE,
>>   .parent = TYPE_PCI_DEVICE,
>> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
>> index ff7cbaa..c9f642c 100644
>> --- a/include/hw/pci/pci_bridge.h
>> +++ b/include/hw/pci/pci_bridge.h
>> @@ -67,4 +67,22 @@ void pci_bridge_map_irq(PCIBridge *br, const char*
>> bus_name,
>>   #define  PCI_BRIDGE_CTL_DISCARD_STATUS0x400   /* Discard timer
>> status */
>>   #define  PCI_BRIDGE_CTL_DISCARD_SERR  0x800   /* Discard timer SERR#
>> enable */
>>   +typedef struct PCIBridgeQemuCap {
>> +uint8_t id; /* Standard PCI capability header field */
>> +uint8_t next;   /* Standard PCI capability header field */
>> +uint8_t len;/* Standard PCI vendor-specific capability header
>> field */
>> +uint8_t bus_res;
>> +uint32_t pref_lim_upper;
>> +uint16_t pref_lim;
>> +uint16_t mem_lim;
>>
>
> This 32bit IOMEM, right?
>
> +uint16_t io_lim_upper;
>> +uint8_t io_lim;
>>
>
> Why do we need io_lim and io_lim_upper?
>

The idea was to be able to directly move the capability fields values
to the registers when actually using it (in firmware) code.
Secondly, it saves a little space by avoding usage 32-bit types
when 24-bit ones are desired. And the same thing with prefetchable (48->64).
But if it's more convenient no to split this value, I can do that.



>
> Thanks,
> Marcel
>
>
> +uint8_t padding;
>> +} PCIBridgeQemuCap;
>> +
>> +int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,
>> +  uint8_t bus_reserve, uint32_t io_limit,
>> +  uint16_t mem_limit, uint64_t pref_limit,
>> +  Error **errp);
>> +
>>   #endif /* QEMU_PCI_BRIDGE_H */
>>
>>
>


-- 
Alexander Bezzubikov


Re: [Qemu-devel] [RFC PATCH v2 4/6] hw/pci: introduce bridge-only vendor-specific capability to provide some hints to firmware

2017-07-23 Thread Marcel Apfelbaum

On 23/07/2017 19:19, Alexander Bezzubikov wrote:
2017-07-23 18:57 GMT+03:00 Marcel Apfelbaum >:


On 23/07/2017 1:15, Aleksandr Bezzubikov wrote:

On PCI init PCI bridges may need some
extra info about bus number to reserve, IO, memory and
prefetchable memory limits. QEMU can provide this
with special vendor-specific PCI capability.

Sizes of limits match ones from
PCI Type 1 Configuration Space Header,
number of buses to reserve occupies only 1 byte
since it is the size of Subordinate Bus Number register.


Hi Alexandr,

Signed-off-by: Aleksandr Bezzubikov mailto:zuban...@gmail.com>>
---
   hw/pci/pci_bridge.c | 27 +++
   include/hw/pci/pci_bridge.h | 18 ++
   2 files changed, 45 insertions(+)

diff --git a/hw/pci/pci_bridge.c b/hw/pci/pci_bridge.c
index 720119b..8ec6c2c 100644
--- a/hw/pci/pci_bridge.c
+++ b/hw/pci/pci_bridge.c
@@ -408,6 +408,33 @@ void pci_bridge_map_irq(PCIBridge *br,
const char* bus_name,
   br->bus_name = bus_name;
   }
   +
+int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,


Can you please rename to something like
'pci_bridge_qemu_cap_init' to be more specific?

+  uint8_t bus_reserve, uint32_t
io_limit,
+  uint16_t mem_limit, uint64_t
pref_limit,



I am not sure regarding "limit" suffix, this
is a reservation, not a limitation.


I'd like this fields names to match actual registers which
are going to get the values.



For this case I think is better to have io_res..., to describe
the parameter rather than match the registers.




+  Error **errp)
+{
+size_t cap_len = sizeof(PCIBridgeQemuCap);
+PCIBridgeQemuCap cap;
+
+cap.len = cap_len;
+cap.bus_res = bus_reserve;
+cap.io_lim = io_limit & 0xFF;
+cap.io_lim_upper = io_limit >> 8 & 0x;
+cap.mem_lim = mem_limit;
+cap.pref_lim = pref_limit & 0x;
+cap.pref_lim_upper = pref_limit >> 16 & 0x;
+
+int offset = pci_add_capability(dev, PCI_CAP_ID_VNDR,
+cap_offset, cap_len, errp);
+if (offset < 0) {
+return offset;
+}
+
+memcpy(dev->config + offset + 2, (char *)&cap + 2, cap_len
- 2);
+return 0;
+}
+
   static const TypeInfo pci_bridge_type_info = {
   .name = TYPE_PCI_BRIDGE,
   .parent = TYPE_PCI_DEVICE,
diff --git a/include/hw/pci/pci_bridge.h
b/include/hw/pci/pci_bridge.h
index ff7cbaa..c9f642c 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -67,4 +67,22 @@ void pci_bridge_map_irq(PCIBridge *br, const
char* bus_name,
   #define  PCI_BRIDGE_CTL_DISCARD_STATUS0x400   /*
Discard timer status */
   #define  PCI_BRIDGE_CTL_DISCARD_SERR  0x800   /* Discard
timer SERR# enable */
   +typedef struct PCIBridgeQemuCap {
+uint8_t id; /* Standard PCI capability header field */
+uint8_t next;   /* Standard PCI capability header field */
+uint8_t len;/* Standard PCI vendor-specific capability
header field */
+uint8_t bus_res;
+uint32_t pref_lim_upper;
+uint16_t pref_lim;
+uint16_t mem_lim;


This 32bit IOMEM, right?

+uint16_t io_lim_upper;
+uint8_t io_lim;


Why do we need io_lim and io_lim_upper?


The idea was to be able to directly move the capability fields values
to the registers when actually using it (in firmware) code.
Secondly, it saves a little space by avoding usage 32-bit types
when 24-bit ones are desired. And the same thing with prefetchable (48->64).
But if it's more convenient no to split this value, I can do that.


With a clear explanation (Mimic of the )
I personally don't mind keeping it like that.

Thanks,
Marcel




Thanks,
Marcel


+uint8_t padding;
+} PCIBridgeQemuCap;
+
+int pci_bridge_help_cap_init(PCIDevice *dev, int cap_offset,
+  uint8_t bus_reserve, uint32_t
io_limit,
+  uint16_t mem_limit, uint64_t
pref_limit,
+  Error **errp);
+
   #endif /* QEMU_PCI_BRIDGE_H */





--
Alexander Bezzubikov





Re: [Qemu-devel] [RFC PATCH v2 1/4] pci: refactor pci_find_capapibilty to get bdf as the first argument instead of the whole pci_device

2017-07-23 Thread Kevin O'Connor
On Sun, Jul 23, 2017 at 01:11:47AM +0300, Aleksandr Bezzubikov wrote:
> Refactor pci_find_capability function to get bdf instead of
> a whole pci_device* as the only necessary field for this function 
> is still bdf.
> It greatly helps when we have bdf but not pci_device.
> 
> Signed-off-by: Aleksandr Bezzubikov 
> ---
>  src/fw/pciinit.c|  4 ++--
>  src/hw/pcidevice.c  | 12 ++--
>  src/hw/pcidevice.h  |  2 +-
>  src/hw/virtio-pci.c |  4 ++--
>  4 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
> index 08221e6..864954f 100644
> --- a/src/fw/pciinit.c
> +++ b/src/fw/pciinit.c
> @@ -762,7 +762,7 @@ static int pci_bus_hotplug_support(struct pci_bus *bus, 
> u8 pcie_cap)
>  return downstream_port && slot_implemented;
>  }
>  
> -shpc_cap = pci_find_capability(bus->bus_dev, PCI_CAP_ID_SHPC, 0);
> +shpc_cap = pci_find_capability(bus->bus_dev->bdf, PCI_CAP_ID_SHPC, 0);
>  return !!shpc_cap;
>  }
>  
> @@ -844,7 +844,7 @@ static int pci_bios_check_devices(struct pci_bus *busses)
>   */
>  parent = &busses[0];
>  int type;
> -u8 pcie_cap = pci_find_capability(s->bus_dev, PCI_CAP_ID_EXP, 0);
> +u8 pcie_cap = pci_find_capability(s->bus_dev->bdf, PCI_CAP_ID_EXP, 
> 0);
>  int hotplug_support = pci_bus_hotplug_support(s, pcie_cap);
>  for (type = 0; type < PCI_REGION_TYPE_COUNT; type++) {
>  u64 align = (type == PCI_REGION_TYPE_IO) ?
> diff --git a/src/hw/pcidevice.c b/src/hw/pcidevice.c
> index cfebf66..d01e27b 100644
> --- a/src/hw/pcidevice.c
> +++ b/src/hw/pcidevice.c
> @@ -134,25 +134,25 @@ pci_find_init_device(const struct pci_device_id *ids, 
> void *arg)
>  return NULL;
>  }
>  
> -u8 pci_find_capability(struct pci_device *pci, u8 cap_id, u8 cap)
> +u8 pci_find_capability(u16 bdf, u8 cap_id, u8 cap)

Thanks.  If you respin this series, please also move
pci_find_capability() function from pcidevice.c to pci.c.  (If the
function no longer takes a pci_device, it should be moved out of
pcidevice.c.)

-Kevin



Re: [Qemu-devel] >256 Virtio-net-pci hotplug Devices

2017-07-23 Thread Marcel Apfelbaum

On 22/07/2017 2:57, Kinsella, Ray wrote:


Hi Marcel




Hi Ray,


On 21/07/2017 01:33, Marcel Apfelbaum wrote:

On 20/07/2017 3:44, Kinsella, Ray wrote:
That's strange. Please ensure the virtio devices are working in
virtio 1.0 mode (disable-modern=0,disable-legacy=1).
Let us know any problems you see.


Not sure what yet, I will try scaling it with hotplugging tomorrow.



Updates?


I have managed to scale it to 128 devices.
The kernel does complain about IO address space exhaustion.

[   83.697956] pci :80:00.0: BAR 13: no space for [io  size 0x1000]
[   83.700958] pci :80:00.0: BAR 13: failed to assign [io  size 0x1000]
[   83.701689] pci :80:00.1: BAR 13: no space for [io  size 0x1000]
[   83.702378] pci :80:00.1: BAR 13: failed to assign [io  size 0x1000]
[   83.703093] pci :80:00.2: BAR 13: no space for [io  size 0x1000]

I was surprised that I am running out of IO address space, as I am 
disabling legacy virtio. I assumed that this would remove the need for 
SeaBIOS to allocate the PCI Express Root Port IO address space.


Indeed, SeeBIOS does not reserve IO ports in this case, but Linux kernel
still decides ""it knows better" and tries to allocate IO resources
anyway. It does not affect the "modern" virtio-net devices because
they don't need IO ports anyway.

One way to work around the error message is to have the PCIe Root Port
have the corresponding IO headers read-only since IO support is
optional. I tried this some time ago, I'll get back to it.

In any 
case - it doesn't stop the virtio-net device coming up and working as 
expected.




Right.


[  668.692081] virtio_net virtio103 enp141s0f4: renamed from eth101
[  668.707114] virtio_net virtio130 enp144s0f7: renamed from eth128
[  668.719795] virtio_net virtio129 enp144s0f6: renamed from eth127

I encountered some issues in vhost, due to open file exhaustion but 
resolved these with 'ulimit' in the usual way - burned alot of time on 
that today.


When scaling up to 512 Virtio-net devices SeaBIOS appears to really slow 
down when configuring PCI Config space - haven't manage to get this to 
work yet.




Adding SeaBIOS mailing list and maintainers, maybe there is a known
issue about 500+ PCI devices configuration.


Not really. All you have to do is to add a property to the pxb-pci/pxb
devices: pci_domain=x; then update the ACPI table to include the pxb
domain. You also have to tweak a little the pxb-pcie/pxb devices
to not share the bus numbers if pci_domain > 0.


Thanks for information, will add to the list.



Is also on my todo list :)

Thanks,
Marcel


Ray K
\





Re: [Qemu-devel] [RFC PATCH v2 3/4] pci: add QEMU-specific PCI capability structure

2017-07-23 Thread Kevin O'Connor
On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
> On PCI init PCI bridge devices may need some
> extra info about bus number to reserve, IO, memory and
> prefetchable memory limits. QEMU can provide this
> with special vendor-specific PCI capability.
> 
> This capability is intended to be used only
> for Red Hat PCI bridges, i.e. QEMU cooperation.
> 
> Sizes of limits match ones from 
> PCI Type 1 Configuration Space Header,
> number of buses to reserve occupies only 1 byte 
> since it is the size of Subordinate Bus Number register.
> 
> Signed-off-by: Aleksandr Bezzubikov 
> ---
>  src/hw/pci_cap.h | 23 +++
>  1 file changed, 23 insertions(+)
>  create mode 100644 src/hw/pci_cap.h
> 
> diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h
> new file mode 100644
> index 000..1382b0b
> --- /dev/null
> +++ b/src/hw/pci_cap.h
> @@ -0,0 +1,23 @@
> +#ifndef _PCI_CAP_H
> +#define _PCI_CAP_H
> +
> +#include "types.h"
> +
> +struct vendor_pci_cap {
> +u8 id;
> +u8 next;
> +u8 len;
> +};

Thanks.  If you respin this series, please add this header to the
src/fw/ directory instead of src/hw/.  Also, I'd prefer to avoid a
"pci_" prefix on the header as it makes it seem similar to the
existing pci_regs.h and pci_ids.h headers which are a bit different -
how about src/fw/dev-pci.h instead?

-Kevin



Re: [Qemu-devel] [RFC PATCH v2 4/4] pci: enable RedHat PCI bridges to reserve additional buses on PCI init

2017-07-23 Thread Alexander Bezzubikov
2017-07-23 5:49 GMT+03:00 Michael S. Tsirkin :

> On Sun, Jul 23, 2017 at 01:11:50AM +0300, Aleksandr Bezzubikov wrote:
> > In case of Red Hat PCI bridges reserve additional buses, which number is
> provided
> > in a vendor-specific capability.
> >
> > Signed-off-by: Aleksandr Bezzubikov 
> > ---
> >  src/fw/pciinit.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
> > index 864954f..f05a8b9 100644
> > --- a/src/fw/pciinit.c
> > +++ b/src/fw/pciinit.c
> > @@ -15,6 +15,7 @@
> >  #include "hw/pcidevice.h" // pci_probe_devices
> >  #include "hw/pci_ids.h" // PCI_VENDOR_ID_INTEL
> >  #include "hw/pci_regs.h" // PCI_COMMAND
> > +#include "hw/pci_cap.h" // qemu_pci_cap
> >  #include "list.h" // struct hlist_node
> >  #include "malloc.h" // free
> >  #include "output.h" // dprintf
> > @@ -578,9 +579,18 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus)
> >  pci_bios_init_bus_rec(secbus, pci_bus);
> >
> >  if (subbus != *pci_bus) {
> > +u8 res_bus = 0;
> > +if (pci_config_readw(bdf, PCI_VENDOR_ID) ==
> PCI_VENDOR_ID_REDHAT) {
>
> Check device ID as well.

Not sure what ID/IDs we want to see here exactly. Now it can be
only pcie-root-port (0xC then), but


>
> > +u8 cap = pci_find_capability(bdf, PCI_CAP_ID_VNDR, 0);
>
> There could be multiple vendor capabilities. You want to scan them all.
>
>
> > +if (cap) {
> > +res_bus = pci_config_readb(bdf,
> > +cap + offsetof(struct redhat_pci_bridge_cap,
> > +   bus_res));
>
>
> You might want to add sanity checks e.g. overflow, and capability
> length.
>
> Also, if all you use is offsetof, don't bother with a struct, just
> add some defines.
>
OK, will move this to defines.


>
> > +}
> > +}
> >  dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
> > -subbus, *pci_bus);
> > -subbus = *pci_bus;
> > +subbus, *pci_bus + res_bus);
> > +subbus = *pci_bus + res_bus;
>
> So you take all present devices and add reserved ones - is that it?
> If so it looks like this will steal extra buses each time you
> add a child bus and reboot.
>

You're right, such problem persists. Will think on what can be done here.


>
>
> >  } else {
> >  dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus);
> >  }
> > --
> > 2.7.4
>



-- 
Alexander Bezzubikov


Re: [Qemu-devel] [RFC PATCH v2 3/4] pci: add QEMU-specific PCI capability structure

2017-07-23 Thread Alexander Bezzubikov
2017-07-23 19:30 GMT+03:00 Kevin O'Connor :

> On Sun, Jul 23, 2017 at 01:11:49AM +0300, Aleksandr Bezzubikov wrote:
> > On PCI init PCI bridge devices may need some
> > extra info about bus number to reserve, IO, memory and
> > prefetchable memory limits. QEMU can provide this
> > with special vendor-specific PCI capability.
> >
> > This capability is intended to be used only
> > for Red Hat PCI bridges, i.e. QEMU cooperation.
> >
> > Sizes of limits match ones from
> > PCI Type 1 Configuration Space Header,
> > number of buses to reserve occupies only 1 byte
> > since it is the size of Subordinate Bus Number register.
> >
> > Signed-off-by: Aleksandr Bezzubikov 
> > ---
> >  src/hw/pci_cap.h | 23 +++
> >  1 file changed, 23 insertions(+)
> >  create mode 100644 src/hw/pci_cap.h
> >
> > diff --git a/src/hw/pci_cap.h b/src/hw/pci_cap.h
> > new file mode 100644
> > index 000..1382b0b
> > --- /dev/null
> > +++ b/src/hw/pci_cap.h
> > @@ -0,0 +1,23 @@
> > +#ifndef _PCI_CAP_H
> > +#define _PCI_CAP_H
> > +
> > +#include "types.h"
> > +
> > +struct vendor_pci_cap {
> > +u8 id;
> > +u8 next;
> > +u8 len;
> > +};
>
> Thanks.

Thanks for the review.


> If you respin this series, please add this header to the
> src/fw/ directory instead of src/hw/.  Also, I'd prefer to avoid a
> "pci_" prefix on the header as it makes it seem similar to the
> existing pci_regs.h and pci_ids.h headers which are a bit different -
> how about src/fw/dev-pci.h instead?
>
> -Kevin
>

No objections, Will do that in v3.
Except 'pci_' prefix - it's still a PCI capability, isn't it?


-- 
Alexander Bezzubikov


Re: [Qemu-devel] [RFC PATCH v2 2/6] hw/i386: allow SHPC for Q35 machine

2017-07-23 Thread Alexander Bezzubikov
2017-07-23 18:59 GMT+03:00 Marcel Apfelbaum :

> On 23/07/2017 1:15, Aleksandr Bezzubikov wrote:
>
>> Unmask previously masked SHPC feature in _OSC method.
>>
>> Signed-off-by: Aleksandr Bezzubikov 
>> ---
>>   hw/i386/acpi-build.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index 6b7bade..0d99585 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -1850,7 +1850,7 @@ static Aml *build_q35_osc_method(void)
>>* Always allow native PME, AER (no dependencies)
>>* Never allow SHPC (no SHPC controller in this system)
>>
>
> Seems the above comment is not longer correct :)
>
Just missed it, will fix in v3.


>
> Thanks,
> Marcel
>
>
>*/
>> -aml_append(if_ctx, aml_and(a_ctrl, aml_int(0x1D), a_ctrl));
>> +aml_append(if_ctx, aml_and(a_ctrl, aml_int(0x1F), a_ctrl));
>> if_ctx2 = aml_if(aml_lnot(aml_equal(aml_arg(1), aml_int(1;
>>   /* Unknown revision */
>>
>>
>


-- 
Alexander Bezzubikov


Re: [Qemu-devel] [PATCH v2 3/5] target/s390x: Rework program_interrupt() and related functions

2017-07-23 Thread Richard Henderson

On 07/21/2017 04:45 AM, Thomas Huth wrote:

+static void tcg_s390_program_interrupt(CPUS390XState *env, uint32_t code,
+   int ilen)
+{
+trigger_pgm_exception(env, code, ilen);
+#ifdef CONFIG_TCG
+cpu_loop_exit(CPU(s390_env_get_cpu(env)));
+#endif
+}
+


How about the whole body as g_assert_not_reached ifndef.

Otherwise,

Reviewed-by: Richard Henderson 


r~



Re: [Qemu-devel] [PATCH v2 4/5] target/s390x: Move exception-related functions to a new excp_helper.c file

2017-07-23 Thread Richard Henderson

On 07/21/2017 04:45 AM, Thomas Huth wrote:

These functions can not be compiled with --disable-tcg. But since we
need the other functions from helper.c in the non-tcg build, we can also
not simply remove helper.c from the non-tcg builds. Thus the problematic
functions have to be moved into a separate new file instead that we
can later omit in the non-tcg builds.

Signed-off-by: Thomas Huth
---
  target/s390x/Makefile.objs |   2 +-
  target/s390x/cpu.h |   8 +
  target/s390x/excp_helper.c | 515 +
  target/s390x/helper.c  | 472 +
  4 files changed, 528 insertions(+), 469 deletions(-)
  create mode 100644 target/s390x/excp_helper.c


Reviewed-by: Richard Henderson 


r~



Re: [Qemu-devel] [PATCH v2 5/5] target/s390x: Add remaining switches to compile with --disable-tcg

2017-07-23 Thread Richard Henderson

On 07/21/2017 07:45 AM, Thomas Huth wrote:

Adding some CONFIG_TCG tests to be finally able to compile QEMU
on s390x also without TCG.

Signed-off-by: Thomas Huth
---
  target/s390x/Makefile.objs | 6 +++---
  target/s390x/cpu.c | 4 
  2 files changed, 7 insertions(+), 3 deletions(-)


Reviewed-by: Richard Henderson 


r~



Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Rik van Riel
On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
> [ adding Ross and Jan ]
> 
> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
> wrote:
> > 
> > The goal is to increase density of guests, by moving page
> > cache into the host (where it can be easily reclaimed).
> > 
> > If we assume the guests will be backed by relatively fast
> > SSDs, a "whole device flush" from filesystem journaling
> > code (issued where the filesystem issues a barrier or
> > disk cache flush today) may be just what we need to make
> > that work.
> 
> Ok, apologies, I indeed had some pieces of the proposal confused.
> 
> However, it still seems like the storage interface is not capable of
> expressing what is needed, because the operation that is needed is a
> range flush. In the guest you want the DAX page dirty tracking to
> communicate range flush information to the host, but there's no
> readily available block i/o semantic that software running on top of
> the fake pmem device can use to communicate with the host. Instead
> you
> want to intercept the dax_flush() operation and turn it into a queued
> request on the host.
> 
> In 4.13 we have turned this dax_flush() operation into an explicit
> driver call. That seems a better interface to modify than trying to
> map block-storage flush-cache / force-unit-access commands to this
> host request.
> 
> The additional piece you would need to consider is whether to track
> all writes in addition to mmap writes in the guest as DAX-page-cache
> dirtying events, or arrange for every dax_copy_from_iter()
> operation()
> to also queue a sync on the host, but that essentially turns the host
> page cache into a pseudo write-through mode.

I suspect initially it will be fine to not offer DAX
semantics to applications using these "fake DAX" devices
from a virtual machine, because the DAX APIs are designed
for a much higher performance device than these fake DAX
setups could ever give.

Having userspace call fsync/msync like done normally, and
having those coarser calls be turned into somewhat efficient
backend flushes would be perfectly acceptable.

The big question is, what should that kind of interface look
like?



Re: [Qemu-devel] [RFC PATCH v2 4/4] pci: enable RedHat PCI bridges to reserve additional buses on PCI init

2017-07-23 Thread Alexander Bezzubikov
By the way, any ideas on how to avoid 'bus overstealing' would
be greatly appreciated.
Static BIOS variable isn't applicable since its value isn't saved across
reboots.

2017-07-23 19:43 GMT+03:00 Alexander Bezzubikov :

> 2017-07-23 5:49 GMT+03:00 Michael S. Tsirkin :
>
>> On Sun, Jul 23, 2017 at 01:11:50AM +0300, Aleksandr Bezzubikov wrote:
>> > In case of Red Hat PCI bridges reserve additional buses, which number
>> is provided
>> > in a vendor-specific capability.
>> >
>> > Signed-off-by: Aleksandr Bezzubikov 
>> > ---
>> >  src/fw/pciinit.c | 14 --
>> >  1 file changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/src/fw/pciinit.c b/src/fw/pciinit.c
>> > index 864954f..f05a8b9 100644
>> > --- a/src/fw/pciinit.c
>> > +++ b/src/fw/pciinit.c
>> > @@ -15,6 +15,7 @@
>> >  #include "hw/pcidevice.h" // pci_probe_devices
>> >  #include "hw/pci_ids.h" // PCI_VENDOR_ID_INTEL
>> >  #include "hw/pci_regs.h" // PCI_COMMAND
>> > +#include "hw/pci_cap.h" // qemu_pci_cap
>> >  #include "list.h" // struct hlist_node
>> >  #include "malloc.h" // free
>> >  #include "output.h" // dprintf
>> > @@ -578,9 +579,18 @@ pci_bios_init_bus_rec(int bus, u8 *pci_bus)
>> >  pci_bios_init_bus_rec(secbus, pci_bus);
>> >
>> >  if (subbus != *pci_bus) {
>> > +u8 res_bus = 0;
>> > +if (pci_config_readw(bdf, PCI_VENDOR_ID) ==
>> PCI_VENDOR_ID_REDHAT) {
>>
>> Check device ID as well.
>
> Not sure what ID/IDs we want to see here exactly. Now it can be
> only pcie-root-port (0xC then), but
>
>
>>
>> > +u8 cap = pci_find_capability(bdf, PCI_CAP_ID_VNDR, 0);
>>
>> There could be multiple vendor capabilities. You want to scan them all.
>>
>>
>> > +if (cap) {
>> > +res_bus = pci_config_readb(bdf,
>> > +cap + offsetof(struct
>> redhat_pci_bridge_cap,
>> > +   bus_res));
>>
>>
>> You might want to add sanity checks e.g. overflow, and capability
>> length.
>>
>> Also, if all you use is offsetof, don't bother with a struct, just
>> add some defines.
>>
> OK, will move this to defines.
>
>
>>
>> > +}
>> > +}
>> >  dprintf(1, "PCI: subordinate bus = 0x%x -> 0x%x\n",
>> > -subbus, *pci_bus);
>> > -subbus = *pci_bus;
>> > +subbus, *pci_bus + res_bus);
>> > +subbus = *pci_bus + res_bus;
>>
>> So you take all present devices and add reserved ones - is that it?
>> If so it looks like this will steal extra buses each time you
>> add a child bus and reboot.
>>
>
> You're right, such problem persists. Will think on what can be done here.
>
>
>>
>>
>> >  } else {
>> >  dprintf(1, "PCI: subordinate bus = 0x%x\n", subbus);
>> >  }
>> > --
>> > 2.7.4
>>
>
>
>
> --
> Alexander Bezzubikov
>



-- 
Alexander Bezzubikov


Re: [Qemu-devel] [PATCH 00/11] Make memory_region_init_ram() and friends handle migration

2017-07-23 Thread Peter Maydell
On 22 July 2017 at 05:47, Philippe Mathieu-Daudé  wrote:
> I'm seeing memleaks using the malta machine, they come from the
> smbus_eeprom_init() in hw/i2c/smbus_eeprom.c which does:
>
> uint8_t *eeprom_buf = g_malloc0(8 * 256); /* XXX: make this persistent
> */

(This isn't actually a leak because the eeprom device
can't be hotplugged/unplugged, so once created it will
live for the life of the simulation.)

> Question for 2.11: is this device a candidate to use one of the function you
> enumerated? My guess is memory_region_init_rom_device()

A quick look at that device suggests that it's doing some
pretty weird stuff which probably should be being
cleaned up to use a memory region to handle the backing
store for the EEPROM (although there are other ways to do it;
since there's only 2K of data there it could also be
handled by a byte-array in the device migrated via
VMState).

Do you know what the intention is of the code which
passes the device an offset into the eeprom_buf buffer
rather than just passing the start of the buffer ?

> Question for 2.10: does that mean than the malta machine is not migratable?

Depends whether the guest is actually writing to the EEPROM :-)
The smbus_eeprom device doesn't have any support for migration
at all -- it is neither migrating the data, nor the internal
state (the offset). But if the guest doesn't ever write to
the EEPROM and it doesn't happen to be relying on the offset
having a particular value at the point of migration, you
won't notice.

In particular, the pc_piix and pc_q35 machines both create
one of these devices, so this is an issue for the pc
machine migration too... (and any fix we make to
smbus_eeprom needs to handle migration compat as a result)

thanks
-- PMM



Re: [Qemu-devel] [PATCH 1/5] Makefile: clean: Clean exe files

2017-07-23 Thread Peter Maydell
On 5 July 2017 at 08:54, Sameeh Jubran  wrote:
> From: Sameeh Jubran 
>
> Clean exe files such as qemu-ga.exe
>
> Signed-off-by: Sameeh Jubran 
> ---
>  Makefile | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/Makefile b/Makefile
> index 16a0430..22d29d6 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -487,6 +487,7 @@ clean:
> rm -f config.mak op-i386.h opc-i386.h gen-op-i386.h op-arm.h 
> opc-arm.h gen-op-arm.h
> rm -f qemu-options.def
> rm -f *.msi
> +   rm -f *${EXESUF}

On every host OS except Windows, ${EXESUF} is defined
to be the empty string, which means that this will be
"rm -f *", which is probably not what we want...

thanks
-- PMM



Re: [Qemu-devel] >256 Virtio-net-pci hotplug Devices

2017-07-23 Thread Kevin O'Connor
On Sun, Jul 23, 2017 at 07:28:01PM +0300, Marcel Apfelbaum wrote:
> On 22/07/2017 2:57, Kinsella, Ray wrote:
> > When scaling up to 512 Virtio-net devices SeaBIOS appears to really slow
> > down when configuring PCI Config space - haven't manage to get this to
> > work yet.

If there is a slowdown in SeaBIOS, it would help to produce a log with
timing information - see:
https://www.seabios.org/Debugging#Timing_debug_messages

It may also help to increase the debug level in SeaBIOS to get more
fine grained timing reports.

-Kevin



[Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Hua Yanghao
>From 84f25a8e4269f44255a8037837fdaa6e5404b76e Mon Sep 17 00:00:00 2001
From: Hua Yanghao 
Date: Sun, 23 Jul 2017 21:48:21 +0200
Subject: [PATCH] hw/core/loader: do not check for regions overlap

There is a use case where regions are overlapped on purpose.
It should be up to the linker to check for regions overlap and qemu should
better not to do so.

For example:
SECTIONS
{
.. /* the normal text/data */
_on_chip_ram_load_start = .;
. = . + SIZEOF(.on_chip_ram_section);
_on_chip_ram_load_end = .;

.bss (NOLOAD) {
*(.bss*)
*(COM*)
}
.. /* other things */
}

with current hw/core/loader implementation, .bss section consumes no LMU address
space as .bss is not a loadable section, it is only useful for VMA at runtime.
Where the .on_chip_ram_section is told to be loaded (LMA) from the VMA of .bss
section. And qemu complains and exits. This patch fixes the issue and tested
running fine for the above scenario.
---
 hw/core/loader.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index c17ace0a2e..9542cb3555 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -1099,26 +1099,17 @@ int rom_check_and_register_reset(void)
 hwaddr addr = 0;
 MemoryRegionSection section;
 Rom *rom;
-AddressSpace *as = NULL;

 QTAILQ_FOREACH(rom, &roms, next) {
 if (rom->fw_file) {
 continue;
 }
-if ((addr > rom->addr) && (as == rom->as)) {
-fprintf(stderr, "rom: requested regions overlap "
-"(rom %s. free=0x" TARGET_FMT_plx
-", addr=0x" TARGET_FMT_plx ")\n",
-rom->name, addr, rom->addr);
-return -1;
-}
 addr  = rom->addr;
 addr += rom->romsize;
 section = memory_region_find(rom->mr ? rom->mr : get_system_memory(),
  rom->addr, 1);
 rom->isrom = int128_nz(section.size) &&
memory_region_is_rom(section.mr);
 memory_region_unref(section.mr);
-as = rom->as;
 }
 qemu_register_reset(rom_reset, NULL);
 roms_loaded = 1;
-- 
2.11.0



Re: [Qemu-devel] KVM "fake DAX" flushing interface - discussion

2017-07-23 Thread Dan Williams
On Sun, Jul 23, 2017 at 11:10 AM, Rik van Riel  wrote:
> On Sun, 2017-07-23 at 09:01 -0700, Dan Williams wrote:
>> [ adding Ross and Jan ]
>>
>> On Sun, Jul 23, 2017 at 7:04 AM, Rik van Riel 
>> wrote:
>> >
>> > The goal is to increase density of guests, by moving page
>> > cache into the host (where it can be easily reclaimed).
>> >
>> > If we assume the guests will be backed by relatively fast
>> > SSDs, a "whole device flush" from filesystem journaling
>> > code (issued where the filesystem issues a barrier or
>> > disk cache flush today) may be just what we need to make
>> > that work.
>>
>> Ok, apologies, I indeed had some pieces of the proposal confused.
>>
>> However, it still seems like the storage interface is not capable of
>> expressing what is needed, because the operation that is needed is a
>> range flush. In the guest you want the DAX page dirty tracking to
>> communicate range flush information to the host, but there's no
>> readily available block i/o semantic that software running on top of
>> the fake pmem device can use to communicate with the host. Instead
>> you
>> want to intercept the dax_flush() operation and turn it into a queued
>> request on the host.
>>
>> In 4.13 we have turned this dax_flush() operation into an explicit
>> driver call. That seems a better interface to modify than trying to
>> map block-storage flush-cache / force-unit-access commands to this
>> host request.
>>
>> The additional piece you would need to consider is whether to track
>> all writes in addition to mmap writes in the guest as DAX-page-cache
>> dirtying events, or arrange for every dax_copy_from_iter()
>> operation()
>> to also queue a sync on the host, but that essentially turns the host
>> page cache into a pseudo write-through mode.
>
> I suspect initially it will be fine to not offer DAX
> semantics to applications using these "fake DAX" devices
> from a virtual machine, because the DAX APIs are designed
> for a much higher performance device than these fake DAX
> setups could ever give.

Right, we don't need DAX, per se, in the guest.

>
> Having userspace call fsync/msync like done normally, and
> having those coarser calls be turned into somewhat efficient
> backend flushes would be perfectly acceptable.
>
> The big question is, what should that kind of interface look
> like?

To me, this looks much like the dirty cache tracking that is done in
the address_space radix for the DAX case, but modified to coordinate
queued / page-based flushing when the guest  wants to persist data.
The similarity to DAX is not storing guest allocated pages in the
radix but entries that track dirty guest physical addresses.



Re: [Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Peter Maydell
On 23 July 2017 at 21:04, Hua Yanghao  wrote:
> From 84f25a8e4269f44255a8037837fdaa6e5404b76e Mon Sep 17 00:00:00 2001
> From: Hua Yanghao 
> Date: Sun, 23 Jul 2017 21:48:21 +0200
> Subject: [PATCH] hw/core/loader: do not check for regions overlap
>
> There is a use case where regions are overlapped on purpose.
> It should be up to the linker to check for regions overlap and qemu should
> better not to do so.
>
> For example:
> SECTIONS
> {
> .. /* the normal text/data */
> _on_chip_ram_load_start = .;
> . = . + SIZEOF(.on_chip_ram_section);
> _on_chip_ram_load_end = .;
>
> .bss (NOLOAD) {
> *(.bss*)
> *(COM*)
> }
> .. /* other things */
> }

(I'm not sure why you end up with overlapping
program segments here since the bss segment ought
to be marked NOLOAD and ignored by QEMU. But I've
definitely seen overlapping segments myself where
the segments overlapping both are marked as LOAD.)

> with current hw/core/loader implementation, .bss section consumes no LMU 
> address
> space as .bss is not a loadable section, it is only useful for VMA at runtime.
> Where the .on_chip_ram_section is told to be loaded (LMA) from the VMA of .bss
> section. And qemu complains and exits. This patch fixes the issue and tested
> running fine for the above scenario.
> ---

Oddly enough, I just ran into this scenario with an
ELF file myself the other day, and worked around it
with pretty much the same kind of local hack as this.
I agree that since this kind of ELF file with overlapping
segments seems to be quite common we should load it,
rather than complaining.

But...

(1) does this change give the right behaviour for
which of the two overlapping segment is honoured?
(I *think* the correct answer is that the second
segment in the program header table should be
loaded second, ie its definition of the memory
contents is used, not that of the first segment)

(2) should we allow the overlap only for ELF files but
retain the complain for overlapping ROMs of other types?
(eg by having the elf loader create "rom"s which
don't overlap by trimming the overlap itself)

Does anybody know (a) what the ELF spec mandates
for overlapping segments and (b) what the history
and rationale for QEMU's overlapping-roms check is?

thanks
-- PMM



Re: [Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Hua Yanghao
> (I'm not sure why you end up with overlapping
> program segments here since the bss segment ought
> to be marked NOLOAD and ignored by QEMU. But I've
> definitely seen overlapping segments myself where
> the segments overlapping both are marked as LOAD.)
Looks like qemu is not ignoring NOLOAD seciton check and used its VMA
as if it is LMA ...
If I remove the NOLOAD then .bss consumesthe LMA addresss and
everything is working fine.

> Oddly enough, I just ran into this scenario with an
> ELF file myself the other day, and worked around it
> with pretty much the same kind of local hack as this.
> I agree that since this kind of ELF file with overlapping
> segments seems to be quite common we should load it,
> rather than complaining.
Good to know I am not the only one who hit this issue ;-)

> But...
>
> (1) does this change give the right behaviour for
> which of the two overlapping segment is honoured?
> (I *think* the correct answer is that the second
> segment in the program header table should be
> loaded second, ie its definition of the memory
> contents is used, not that of the first segment)
I am not sure if I understood this point. linker will check for
section overlaps and if linker
did not complain why should qemu complain. qemu should simply follow
the LMA for each
section and ignore NOLOAD section for me this is the sane behavior.

> (2) should we allow the overlap only for ELF files but
> retain the complain for overlapping ROMs of other types?
> (eg by having the elf loader create "rom"s which
> don't overlap by trimming the overlap itself)
I think this should only apply for ELF files. (this is the only thing
I care or understood so far,
so not touching other types of input binaries ... however I am not
sure if I am capable to provide
a right patch to handle the ROM files properly, I only looked into
qemu code starting today ...)

> Does anybody know (a) what the ELF spec mandates
> for overlapping segments and (b) what the history
> and rationale for QEMU's overlapping-roms check is?
>
> thanks
> -- PMM



Re: [Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Peter Maydell
On 23 July 2017 at 21:58, Hua Yanghao  wrote:
>> (1) does this change give the right behaviour for
>> which of the two overlapping segment is honoured?
>> (I *think* the correct answer is that the second
>> segment in the program header table should be
>> loaded second, ie its definition of the memory
>> contents is used, not that of the first segment)
> I am not sure if I understood this point. linker will check for
> section overlaps and if linker
> did not complain why should qemu complain. qemu should simply follow
> the LMA for each
> section and ignore NOLOAD section for me this is the sane behavior.

Suppose we have these two segments:
 SEGMENT 1: start 0x1000, end 0x2fff, data all 0xff
 SEGMENT 2: start 0x2000, end 0x3fff, data all 0x00

Clearly for the memory 0x1000..0x1fff we want the 0xff
data, and for 0x3000..0x3fff we want 0x00.
But for the memory 0x2000..0x2fff which is in
both segment 1 and segment 2, should QEMU load
0xff or 0x00 bytes ?

We shouldn't pick randomly or just do whatever our
implementation "happens to do" -- we need to look
at what the ELF spec says must happen and do that.

>> (2) should we allow the overlap only for ELF files but
>> retain the complain for overlapping ROMs of other types?
>> (eg by having the elf loader create "rom"s which
>> don't overlap by trimming the overlap itself)
> I think this should only apply for ELF files. (this is the only thing
> I care or understood so far,
> so not touching other types of input binaries ... however I am not
> sure if I am capable to provide
> a right patch to handle the ROM files properly, I only looked into
> qemu code starting today ...)

The patch you have here will affect QEMU's handling
of ROMs of all types, because the loader.c code
handles all the registered ROM images, not just those
that the ELF loader creates from ELF files.

(PS: these questions are partly aimed at the other
QEMU developers who I cc'd, not just you.)

thanks
-- PMM



Re: [Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Hua Yanghao
> Suppose we have these two segments:
>  SEGMENT 1: start 0x1000, end 0x2fff, data all 0xff
>  SEGMENT 2: start 0x2000, end 0x3fff, data all 0x00
>
> Clearly for the memory 0x1000..0x1fff we want the 0xff
> data, and for 0x3000..0x3fff we want 0x00.
> But for the memory 0x2000..0x2fff which is in
> both segment 1 and segment 2, should QEMU load
> 0xff or 0x00 bytes ?
>
> We shouldn't pick randomly or just do whatever our
> implementation "happens to do" -- we need to look
> at what the ELF spec says must happen and do that.
I don't see how linker could allow that to happen.
If two section overlaps one of them should be of NOLOAD type.
Otherwise linker complains the overlapping of LMA.

So for me just ignore NOLOAD section would do the trick.

> The patch you have here will affect QEMU's handling
> of ROMs of all types, because the loader.c code
> handles all the registered ROM images, not just those
> that the ELF loader creates from ELF files.
>
> (PS: these questions are partly aimed at the other
> QEMU developers who I cc'd, not just you.)
>
Thanks PPM. This is really beyond my knowledge and hope someone could
help making a clean patch!

BR, Yanghao



Re: [Qemu-devel] [PATCH] hw/core/loader: do not check for regions overlap

2017-07-23 Thread Peter Maydell
On 23 July 2017 at 23:11, Hua Yanghao  wrote:
>> Suppose we have these two segments:
>>  SEGMENT 1: start 0x1000, end 0x2fff, data all 0xff
>>  SEGMENT 2: start 0x2000, end 0x3fff, data all 0x00
>>
>> Clearly for the memory 0x1000..0x1fff we want the 0xff
>> data, and for 0x3000..0x3fff we want 0x00.
>> But for the memory 0x2000..0x2fff which is in
>> both segment 1 and segment 2, should QEMU load
>> 0xff or 0x00 bytes ?
>>
>> We shouldn't pick randomly or just do whatever our
>> implementation "happens to do" -- we need to look
>> at what the ELF spec says must happen and do that.
> I don't see how linker could allow that to happen.
> If two section overlaps one of them should be of NOLOAD type.
> Otherwise linker complains the overlapping of LMA.

I have seen ELF files which have this overlap and
where both segments are PT_LOAD. (I think in the cases
I've seen the contents in both segments agree rather than
being different data, but a loader is not going to be
expected to do a comparison of the file data.)

> So for me just ignore NOLOAD section would do the trick.

We already ignore all the ELF segments which are not
of type PT_LOAD:
http://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/elf_ops.h;h=a172a6068a48e233dd802043b3304a9e0a5d3be6;hb=HEAD#l353

If you're hitting this error case then I think the
affected segments must both be of type PT_LOAD.
(If you're not sure you can post here the output
of running 'objdump -p' on the binary or otherwise
show us the program header.)

PS: you're consistently saying "section", but in the
ELF format "section" and "segment" are two different
things. QEMU doesn't actually look at the section table.
In an ELF file, sections are used by the linker, but a
program loader like QEMU (or the Linux kernel)
looks only at the segment table in the program header.

thanks
-- PMM



[Qemu-devel] About virtio device hotplug in Q35!

2017-07-23 Thread Zhong Yang
Hello all,

When we did virtio device hotplug in Q35 platform, which always failed in 
hotplug.

Would you please tell me how to configure VM to make virtio device hotplug work 
in Q35 platform?  Many thanks!


Regards,

Yang zhong



Re: [Qemu-devel] Can I mount encrypt qcow2?

2017-07-23 Thread 陳培泓
I check to the newest version of qemu.

and do the cmds to install followed by the documents in github:

> mkdir build
> cd build
> ../configure
> make


and it show nothing errors​
It's the version when I enter qemu-img --help:

and I execute encrypt format(luks) to the qcow2 file:

​
always shows the error, how to solve it?​

2017-07-21 22:18 GMT+08:00 Eric Blake :

> On 07/21/2017 09:06 AM, Daniel P. Berrange wrote:
> >> Oops, looks like 'git send-email' doesn't know how to auto-cc
> >> 'Reported-by:' tags.
> >
> > That's something that's bugged me too - someone should write a patch for
> > git :-)
>
> Attempted:
> http://marc.info/?l=git&m=150064653516706&w=2
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>
>


Re: [Qemu-devel] [Qemu-ppc] [FIX PATCH v2] spapr: Fix QEMU abort during memory unplug

2017-07-23 Thread David Gibson
On Fri, Jul 21, 2017 at 08:23:02AM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 07/21/2017 03:57 AM, David Gibson wrote:
> > On Fri, Jul 21, 2017 at 10:21:06AM +0530, Bharata B Rao wrote:
> > > Commit 0cffce56 (hw/ppc/spapr.c: adding pending_dimm_unplugs to
> > > sPAPRMachineState) introduced a new way to track pending LMBs of DIMM
> > > device that is marked for removal. Since this commit we can hit the
> > > assert in spapr_pending_dimm_unplugs_add() in the following situation:
> > > 
> > > - DIMM device removal fails as the guest doesn't allow the removal.
> > > - Subsequent attempt to remove the same DIMM would hit the assert
> > >as the corresponding sPAPRDIMMState is still part of the
> > >pending_dimm_unplugs list.
> > > 
> > > Fix this by removing the assert and conditionally adding the
> > > sPAPRDIMMState to pending_dimm_unplugs list only when it is not
> > > already present.
> > > 
> > > Fixes: 0cffce56ae3501c5783d779f97993ce478acf856
> > > Signed-off-by: Bharata B Rao 
> > Applied to ppc-for-2.10.
> > 
> > > ---
> > > Changes in v2:
> > > - sPAPRDIMMState is now allocated within spapr_pending_dimm_unplugs_add()
> > >itself (David Gibson)
> > > - spapr_recover_pending_dimm_state() should never return a NULL 
> > > sPAPRDIMMState,
> > >added an assert for the same.
> > > 
> > >   hw/ppc/spapr.c | 37 +
> > >   1 file changed, 21 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 1cb09e7..2465b27 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -2850,11 +2850,25 @@ static sPAPRDIMMState 
> > > *spapr_pending_dimm_unplugs_find(sPAPRMachineState *s,
> > >   return dimm_state;
> > >   }
> > > -static void spapr_pending_dimm_unplugs_add(sPAPRMachineState *spapr,
> > > -   sPAPRDIMMState *dimm_state)
> > > +static sPAPRDIMMState *spapr_pending_dimm_unplugs_add(sPAPRMachineState 
> > > *spapr,
> > > +  uint32_t nr_lmbs,
> > > +  PCDIMMDevice *dimm)
> > >   {
> > > -g_assert(!spapr_pending_dimm_unplugs_find(spapr, dimm_state->dimm));
> > > -QTAILQ_INSERT_HEAD(&spapr->pending_dimm_unplugs, dimm_state, next);
> > > +sPAPRDIMMState *ds = NULL;
> > > +
> > > +/*
> > > + * If this request is for a DIMM whose removal had failed earlier
> > > + * (due to guest's refusal to remove the LMBs), we would have this
> > > + * dimm already in the pending_dimm_unplugs list. In that
> > > + * case don't add again.
> > > + */
> > > +if (!spapr_pending_dimm_unplugs_find(spapr, dimm)) {
> > > +ds = g_malloc0(sizeof(sPAPRDIMMState));
> > > +ds->nr_lmbs = nr_lmbs;
> > > +ds->dimm = dimm;
> > > +QTAILQ_INSERT_HEAD(&spapr->pending_dimm_unplugs, ds, next);
> > > +}
> > > +return ds;
> In case unplugs_find() founds a DIMM state (the condition this patch is
> trying to fix), ds is
> returned as NULL. This is working because unplug_request() does not use the
> return value
> of unplugs_add() and because recover() is only called if find() returns
> NULL. Still not
> so pretty.
> 
> What makes sense here is something like:
> 
> ds = spapr_pending_dimm_unplugs_find(spapr, dimm);
> if (!ds) {
> (...)
> }
> return ds;
> 
> I think this is not worth sending a v3, David can amend it in the tree.
> After amending feel
> free to add my

Good point, amended.

> 
> Reviewed-by: Daniel Barboza 
> 
> 
> 
> 
> > >   }
> > >   static void spapr_pending_dimm_unplugs_remove(sPAPRMachineState *spapr,
> > > @@ -2875,7 +2889,6 @@ static sPAPRDIMMState 
> > > *spapr_recover_pending_dimm_state(sPAPRMachineState *ms,
> > >   uint32_t avail_lmbs = 0;
> > >   uint64_t addr_start, addr;
> > >   int i;
> > > -sPAPRDIMMState *ds;
> > >   addr_start = object_property_get_int(OBJECT(dimm), 
> > > PC_DIMM_ADDR_PROP,
> > >&error_abort);
> > > @@ -2891,11 +2904,7 @@ static sPAPRDIMMState 
> > > *spapr_recover_pending_dimm_state(sPAPRMachineState *ms,
> > >   addr += SPAPR_MEMORY_BLOCK_SIZE;
> > >   }
> > > -ds = g_malloc0(sizeof(sPAPRDIMMState));
> > > -ds->nr_lmbs = avail_lmbs;
> > > -ds->dimm = dimm;
> > > -spapr_pending_dimm_unplugs_add(ms, ds);
> > > -return ds;
> > > +return spapr_pending_dimm_unplugs_add(ms, avail_lmbs, dimm);
> > >   }
> > >   /* Callback to be called during DRC release. */
> > > @@ -2911,6 +2920,7 @@ void spapr_lmb_release(DeviceState *dev)
> > >* during the unplug process. In this case recover it. */
> > >   if (ds == NULL) {
> > >   ds = spapr_recover_pending_dimm_state(spapr, PC_DIMM(dev));
> > > +g_assert(ds);
> > >   /* The DRC being examined by the caller at least must be 
> > > counted */
> > >   g_assert(ds->nr_lmbs);
>

Re: [Qemu-devel] Unified Datagram Socket Transport

2017-07-23 Thread Jason Wang



On 2017年07月21日 12:25, Anton Ivanov wrote:



On 21/07/17 04:55, Jason Wang wrote:



On 2017年07月21日 03:12, anton.iva...@cambridgegreys.com wrote:

Hi all,

This addresses comments so far except Eric's suggestion to use
InetSocketAddressBase. If I understand correctly its intended use,
it will not be of help for protocols which have no port (raw
sockets - GRE, L2TPv3, etc).

It also includes a port of the original socket.c transport to
the new UDST backend. The relevant code is ifdef-ed so there
should be no effect on other systems.


This looks sub-optimal. If you want to do this, I would rather 
suggest you just extend the socket dgram backend like what udst did now.


Apologies, do you mean extend it further to handle the tcp form? 


Not that far, since you try to convert net_dgram_socket_info, I'm 
thinking just do the all udst in net_dgram_socket. For recvmmsg you can 
have transport specific callback for this.


Thanks



Re: [Qemu-devel] [PATCH v1 0/6] target/s390x: tcg improvments + MSA functions

2017-07-23 Thread Richard Henderson

On 07/21/2017 05:56 AM, David Hildenbrand wrote:

1. smaller pgm irq instruction length fixes


I think maybe we should add insn length data to the unwind data, so that we can 
have env->int_pgm_ilen set to the right value whenever cpu_restore_state is called.



r~



Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-23 Thread David Gibson
On Fri, Jul 21, 2017 at 06:21:31PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-07-21 at 17:50 +1000, David Gibson wrote:
> > On Wed, Jul 19, 2017 at 02:02:18PM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
> > > > 
> > > > I'm somewhat uncomfortable with an irq allocater here in the intc
> > > > code.  As a rule, irq allocation is the responsibility of the machine,
> > > > not any sub-component.  Furthermore, it should allocate in a way which
> > > > is repeatable, since they need to stay stable across reboots and
> > > > migrations.
> > > > 
> > > > And, yes, we have an allocator of sorts in XICS - it has caused a
> > > > number of problems in the past.
> > > 
> > > So
> > > 
> > > For a bare metal model (which we don't have yet) of XIVE, the IRQ
> > > numbering is entirely an artifact of how the HW is configured. There
> > > should thus be no interrupt numbers visible to qemu.
> > 
> > Uh.. I don't entirely follow.  Do you mean that during boot the guest
> > programs the irq numbers into the various components?
> 
> I said a "bare metal model" but yes. Pretty much. 

Right, by "guest" I meant the kernel running under qemu, even if its
running on a bare-metal equivalent platform.

> > In that case this allocator stuff definitely doesn't belong on the
> > xive code.
> > 
> > > For a PAPR model things are a bit different, but if we want to
> > > maximize code re-use between the two, we probably need to make sure
> > > the interrupts "allocated" by the machine for XIVE can be represented
> > > by the HW model.
> > > 
> > > That means:
> > > 
> > >  - Each chip has a range (high bits are the block ID, which maps to a
> > > chip, low bits, around 512K to 1M interrupts is the per-chip space).
> > > 
> > >  - Interrupts 0...N of that range (N depends on how much backing
> > > memory and MMIO space is provisioned for each chip) are "generic IPIs"
> > > which are somewhat generic interrupt source that can be triggered with
> > > an MMIO store and routed to any target. Those are used in PAPR for
> > > things like IPIs and some type of accelerator interrupts.
> > > 
> > >  - Portions of that range (which may or may not overlap the 0...N
> > > above, if they do they "shadow" the generic interrupts) can be
> > > configured to be the HW sources from the various PCIe bridges and
> > > the PSI controller.
> > 
> > Err.. I'm confused how this not sure this relates to spapr.  There are
> > no chips or PSI there, and the PCI bridges aren't really the same
> > thing.
> 
> The above is the HW model, sorry for the confusion. With a few comments
> about how they are used in PAPR.
> 
> So yes, in PAPR there's an "allocator" because the hypervisor will
> create a guest "virtual" (or logical to use PAPR terminology) interrupt
> number space, in order to represents the various interrupts into the
> guest.

Ok, but are each of those logical irqs bound to a specific device/PHB
line/whatever, or can they be configured by the guest?

> Those numbers however are just tokens, they don't have to represent any
> real HW concept. So they can be "allocated" in a rather fixed way, for
> example, you could have something like a fixed map where you put all
> the PCI interrupts at a certain number (a factor of the PHB# with room
> or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
> based would have a chunk of "general purpose" IPIs (for use for actual
> IPIs and for other things to come). And a range for the virtual device
> interrupts for example. Or you can just use an allocator.

Hm.  So what I'm meaning by an "allocator" is something at least
partially dynamic.  Something you say "give me an irq" and it gives
you the next available or similar.  As opposed to any mapping from
devices to (logical) irqs, which the machine will need to supply one
way or another.

> But it's fundamentally an allocator that sits in the hypervisor, so in
> our case, I would say in the spapr "component" of XIVE, rather than the
> XIVE HW model itself.

Maybe..

> Now what Cedric did, because XIVE is very complex and we need something
> for PAPR quickly, is not a complete HW model, but a somewhat simplified
> one that only handles what PAPR exposes. So in that case where the
> allocator sits is a bit of a TBD...

Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
machine type level needs extreme caution, or the irqs may not be
stable which will generally break migration.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH 1/3] Unified Datagram Socket Transport

2017-07-23 Thread Jason Wang



On 2017年07月22日 01:50, Anton Ivanov wrote:

[snip]


+NetUnifiedState *s = (NetUnifiedState *) us;
+L2TPV3TunnelParams *p = (L2TPV3TunnelParams *) s->params;


How about embedding NetUnifiedState into this structure and keep 
using NetL2TPV3State? Then:


-  's' could be kept and lots of lines of changes could be saved here 
and l2tpv3_verify_header()
-  each transport could have their own type instead of using 
NET_CLIENT_DRIVER_L2TPV3


That means each of them having their own read/write functions in each 
transport, destroy functions, etc.


Looks not? Just something like

typedef struct L2TPV3State {
NetUDSTState udst;
/* L2TPV3 specific data */

};

static NetClientInfo l2tpv3_info = {
/* we share this one for all types for now, wrong I know :) */
.type = NET_CLIENT_DRIVER_L2TPV3,
.size = sizeof(L2TPV3State),
.receive = net_udst_receive_dgram,
.receive_iov = net_udst_receive_dgram_iov,
.poll = udst_poll,
.cleanup = net_udst_cleanup,
};

Thanks



I am trying to achieve exactly the opposite which across all 
transports should save more code. There should be nothing in a 
transport which leverages the common datagram processing backend except:


1. Init and parse arguments
2. Form Header
3. Verify Header

All the rest can be common for a large family of datagram based 
transports - L2TPv3, GRE, RAW (both full interface and just pulling a 
specific vlan out of it), etc.


It is trivial to do that for fixed size headers (as in the current 
patchset family). It is a bit more difficult to that for variable 
headers, but still datagram (GUE, Geneve, etc).


These may also add 4 - I/O to control plane, but it remains to be seen 
if that is needed.


This also makes any improvements to the backend - f.e. switching from 
send() to sendmmsg() automatically available for all transports.


What cannot be done is to shoehorn into this stream based. I believe 
we have only one of those - the original socket.c in tcp mode and we 
can leave it to stay that way and switch only the datagram mode to a 
better backend.


I am going through the other comments in the meantime to see if I 
missed something else and fixing the omissions.


A.

[snip]






Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-23 Thread Alexey Kardashevskiy
On 24/07/17 13:28, David Gibson wrote:
> On Fri, Jul 21, 2017 at 06:21:31PM +1000, Benjamin Herrenschmidt wrote:
>> On Fri, 2017-07-21 at 17:50 +1000, David Gibson wrote:
>>> On Wed, Jul 19, 2017 at 02:02:18PM +1000, Benjamin Herrenschmidt wrote:
 On Wed, 2017-07-19 at 13:08 +1000, David Gibson wrote:
>
> I'm somewhat uncomfortable with an irq allocater here in the intc
> code.  As a rule, irq allocation is the responsibility of the machine,
> not any sub-component.  Furthermore, it should allocate in a way which
> is repeatable, since they need to stay stable across reboots and
> migrations.
>
> And, yes, we have an allocator of sorts in XICS - it has caused a
> number of problems in the past.

 So

 For a bare metal model (which we don't have yet) of XIVE, the IRQ
 numbering is entirely an artifact of how the HW is configured. There
 should thus be no interrupt numbers visible to qemu.
>>>
>>> Uh.. I don't entirely follow.  Do you mean that during boot the guest
>>> programs the irq numbers into the various components?
>>
>> I said a "bare metal model" but yes. Pretty much. 
> 
> Right, by "guest" I meant the kernel running under qemu, even if its
> running on a bare-metal equivalent platform.
> 
>>> In that case this allocator stuff definitely doesn't belong on the
>>> xive code.
>>>
 For a PAPR model things are a bit different, but if we want to
 maximize code re-use between the two, we probably need to make sure
 the interrupts "allocated" by the machine for XIVE can be represented
 by the HW model.

 That means:

  - Each chip has a range (high bits are the block ID, which maps to a
 chip, low bits, around 512K to 1M interrupts is the per-chip space).

  - Interrupts 0...N of that range (N depends on how much backing
 memory and MMIO space is provisioned for each chip) are "generic IPIs"
 which are somewhat generic interrupt source that can be triggered with
 an MMIO store and routed to any target. Those are used in PAPR for
 things like IPIs and some type of accelerator interrupts.

  - Portions of that range (which may or may not overlap the 0...N
 above, if they do they "shadow" the generic interrupts) can be
 configured to be the HW sources from the various PCIe bridges and
 the PSI controller.
>>>
>>> Err.. I'm confused how this not sure this relates to spapr.  There are
>>> no chips or PSI there, and the PCI bridges aren't really the same
>>> thing.
>>
>> The above is the HW model, sorry for the confusion. With a few comments
>> about how they are used in PAPR.
>>
>> So yes, in PAPR there's an "allocator" because the hypervisor will
>> create a guest "virtual" (or logical to use PAPR terminology) interrupt
>> number space, in order to represents the various interrupts into the
>> guest.
> 
> Ok, but are each of those logical irqs bound to a specific device/PHB
> line/whatever, or can they be configured by the guest?
> 
>> Those numbers however are just tokens, they don't have to represent any
>> real HW concept. So they can be "allocated" in a rather fixed way, for
>> example, you could have something like a fixed map where you put all
>> the PCI interrupts at a certain number (a factor of the PHB# with room
>> or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
>> based would have a chunk of "general purpose" IPIs (for use for actual
>> IPIs and for other things to come). And a range for the virtual device
>> interrupts for example. Or you can just use an allocator.
> 
> Hm.  So what I'm meaning by an "allocator" is something at least
> partially dynamic.  Something you say "give me an irq" and it gives
> you the next available or similar.  As opposed to any mapping from
> devices to (logical) irqs, which the machine will need to supply one
> way or another.

I am probably reading it wrong but the XIVE's allocator allocates IRQ
ranges for interrupt source controls (which are CPU cores, PHBs, PSI - in
bare metal - so they allocate just once per machine creation). Individual
interrupts are still allocated via spapr_ics_alloc_block().



> 
>> But it's fundamentally an allocator that sits in the hypervisor, so in
>> our case, I would say in the spapr "component" of XIVE, rather than the
>> XIVE HW model itself.
> 
> Maybe..
> 
>> Now what Cedric did, because XIVE is very complex and we need something
>> for PAPR quickly, is not a complete HW model, but a somewhat simplified
>> one that only handles what PAPR exposes. So in that case where the
>> allocator sits is a bit of a TBD...
> 
> Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
> machine type level needs extreme caution, or the irqs may not be
> stable which will generally break migration.
> 


-- 
Alexey



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 3/3] Unified Datagram Socket Transport - raw support

2017-07-23 Thread Jason Wang



On 2017年07月22日 02:50, Anton Ivanov wrote:


[snip]


+"-netdev raw,id=str,ifname=ifname\n"
+"configure a network backend with ID 'str' 
connected to\n"
+"an Ethernet interface named ifname via raw 
socket.\n"
+"This backend does not change the interface 
settings.\n"
+"Most interfaces will require being set into 
promisc mode,\n"
+"as well having most offloads (TSO, etc) turned 
off.\n"
+"Some virtual interfaces like tap support only 
RX.\n"


Pay attention that qemu supports vnet header. So any reason to turn 
off e.g TSO here?


I am not aware of any means to get extra info like checksums, etc show 
up on raw socket read.


If you know a way to make them show up, this is worth investigating.


See packet_rcv_vnet(). But a known 'issue' for raw socket is that it 
forbids change vnet header length after creation, we may need some 
workaround in qemu.







  #endif
  "-netdev 
socket,id=str[,fd=h][,listen=[host]:port][,connect=host:port]\n"
  "configure a network backend to connect to 
another network\n"
@@ -2463,6 +2470,32 @@ qemu-system-i386 linux.img -net nic -net 
gre,src=4.2.3.1,dst=1.2.3.4

@end example
  +@item -netdev raw,id=@var{id},ifname=@var{ifname}
+@itemx -net raw[,vlan=@var{n}][,name=@var{name}],ifname=@var{ifname}
+Connect VLAN @var{n} directly to an Ethernet interface using raw 
socket.

+
+This transport allows a VM to bypass most of the network stack 
which is

+extremely useful for tapping.
+
+@item ifname=@var{ifname}
+interface name (mandatory)
+
+@example
+# set up the interface - put it in promiscuous mode and turn off 
offloads

+ifconfig eth0 up
+ifconfig eth0 promisc
+
+/sbin/ethtool -K eth0 gro off
+/sbin/ethtool -K eth0 tso off
+/sbin/ethtool -K eth0 gso off
+/sbin/ethtool -K eth0 tx off


Any reason to turn off tx here?


Yes - we already have it computed and we have written it as is as a 
whole packet. You do not want it
re-computed as at least some adapters do silly things if you start 
writing raw and the checksum already exists.


This looks like a bug of the driver?

For GRO it's easier to understand since guest may not handle big packets 
with partial checksum. But tso,gso,tx, this still looks questionable for 
the nic which may want to offload them to card (e.g virtio-net).




Once again, this one of the pros/cons of using tpacket vs recv/send 
(with or without mmsg) on a raw socket.


recvm(m)sg/sendm(m)sg are brute force as far as offloads, but things 
like scatter/gather work correctly so there are little copies.


Compared to that, tpacket will allow you some access to checksumming 
which you can map onto checksum offload in a vNIC. As a payback for 
this you end up copying in more cases than for send/recvmmsg and you 
pay penalty for timestamping if you do not have a hardware timestamp 
source in the NIC.


The other issue I always had with tpacket is that you "see" your own 
packets so you have to manage a  RX side BPF filter which removes 
those so you do not see your own packets.


Don't get here, looks like I don't get this 'issue'. Anyway we can 
discuss this when I post the tpacket backend.


Thanks.

That can get quite interesting if you have a lot of MACs on a NIC 
(f.e. when there are multicast apps). Not sure if this is still the 
case - it definitely was in mid 3.x Linux kernels. If you use raw 
sendm(m)sg there is no issue - the packets are not looped when writing 
to physical interfaces.





+
+# launch QEMU instance - if your network has reorder or is very 
lossy add ,pincounter

+
+qemu-system-i386 linux.img -net nic -net raw,ifname=eth0


Can we switch to use -netdev here?


This is done in the new revisions.



Thanks


+
+@end example
+
  @item -netdev 
vde,id=@var{id}[,sock=@var{socketpath}][,port=@var{n}][,group=@var{groupname}][,mode=@var{octalmode}]
  @itemx -net 
vde[,vlan=@var{n}][,name=@var{name}][,sock=@var{socketpath}] 
[,port=@var{n}][,group=@var{groupname}][,mode=@var{octalmode}]
  Connect VLAN @var{n} to PORT @var{n} of a vde switch running on 
host and








Re: [Qemu-devel] [RFC PATCH 06/26] ppc/xive: introduce a XIVE interrupt source model

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:19PM +0200, Cédric Le Goater wrote:
> This is very similar to the current ICS_SIMPLE model in XICS. We try
> to reuse the ICS model because the sPAPR machine is tied to the
> XICSFabric interface and should be using a common framework to switch
> from one controller model to another: XICS <-> XIVE.

Hm.  I'm not entirely concvinced re-using the xics ICSState class in
this way is a good idea, though maybe it's a reasonable first step.
With this patch alone some code is shared, but there are some real
uglies around the edges.

Seems to me at least long term you need to either 1) make the XIVE ics
separate, even if it has similarities to the XICS one or 2) truly
unify them, with a common base type and methods to handle the
differences.


> The next patch will introduce the MMIO handlers to interact with XIVE
> interrupt sources.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c| 110 
> ++
>  include/hw/ppc/xive.h |  12 ++
>  2 files changed, 122 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 5b14d8155317..9ff14c0da595 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -26,6 +26,115 @@
>  
>  #include "xive-internal.h"
>  
> +static void xive_icp_irq(XiveICSState *xs, int lisn)
> +{
> +
> +}
> +
> +/*
> + * XIVE Interrupt Source
> + */
> +static void xive_ics_set_irq_msi(XiveICSState *xs, int srcno, int val)
> +{
> +if (val) {
> +xive_icp_irq(xs, srcno + ICS_BASE(xs)->offset);
> +}
> +}
> +
> +static void xive_ics_set_irq_lsi(XiveICSState *xs, int srcno, int val)
> +{
> +ICSIRQState *irq = &ICS_BASE(xs)->irqs[srcno];
> +
> +if (val) {
> +irq->status |= XICS_STATUS_ASSERTED;
> +} else {
> +irq->status &= ~XICS_STATUS_ASSERTED;
> +}
> +
> +if (irq->status & XICS_STATUS_ASSERTED
> +&& !(irq->status & XICS_STATUS_SENT)) {
> +irq->status |= XICS_STATUS_SENT;
> +xive_icp_irq(xs, srcno + ICS_BASE(xs)->offset);
> +}
> +}
> +
> +static void xive_ics_set_irq(void *opaque, int srcno, int val)
> +{
> +XiveICSState *xs = ICS_XIVE(opaque);
> +ICSIRQState *irq = &ICS_BASE(xs)->irqs[srcno];
> +
> +if (irq->flags & XICS_FLAGS_IRQ_LSI) {
> +xive_ics_set_irq_lsi(xs, srcno, val);
> +} else {
> +xive_ics_set_irq_msi(xs, srcno, val);
> +}
> +}

e.g. you have some code re-use, but still need to more-or-less
duplicate the set_irq code as above.

> +static void xive_ics_reset(void *dev)
> +{
> +ICSState *ics = ICS_BASE(dev);
> +int i;
> +uint8_t flags[ics->nr_irqs];
> +
> +for (i = 0; i < ics->nr_irqs; i++) {
> +flags[i] = ics->irqs[i].flags;
> +}
> +
> +memset(ics->irqs, 0, sizeof(ICSIRQState) * ics->nr_irqs);
> +
> +for (i = 0; i < ics->nr_irqs; i++) {
> +ics->irqs[i].flags = flags[i];
> +}

This save, clear, restore is also kind ugly.  I'm also not sure why
this needs a reset method when I can't find one for the xics ICS.

Does the xics irqstate structure really cover what you need for xive?
I had the impression elsewhere that xive had a different priority
model to xics.  And there's the xics pointer in the icsstate structure
which is definitely redundant.

> +}
> +
> +static void xive_ics_realize(ICSState *ics, Error **errp)
> +{
> +XiveICSState *xs = ICS_XIVE(ics);
> +Object *obj;
> +Error *err = NULL;
> +
> +obj = object_property_get_link(OBJECT(xs), "xive", &err);
> +if (!obj) {
> +error_setg(errp, "%s: required link 'xive' not found: %s",
> +   __func__, error_get_pretty(err));
> +return;
> +}
> +xs->xive = XIVE(obj);
> +
> +if (!ics->nr_irqs) {
> +error_setg(errp, "Number of interrupts needs to be greater 0");
> +return;
> +}
> +
> +ics->irqs = g_malloc0(ics->nr_irqs * sizeof(ICSIRQState));
> +ics->qirqs = qemu_allocate_irqs(xive_ics_set_irq, xs, ics->nr_irqs);
> +
> +qemu_register_reset(xive_ics_reset, xs);
> +}
> +
> +static Property xive_ics_properties[] = {
> +DEFINE_PROP_UINT32("nr-irqs", ICSState, nr_irqs, 0),
> +DEFINE_PROP_UINT32("irq-base", ICSState, offset, 0),
> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void xive_ics_class_init(ObjectClass *klass, void *data)
> +{
> +DeviceClass *dc = DEVICE_CLASS(klass);
> +ICSStateClass *isc = ICS_BASE_CLASS(klass);
> +
> +isc->realize = xive_ics_realize;
> +
> +dc->props = xive_ics_properties;
> +}
> +
> +static const TypeInfo xive_ics_info = {
> +.name = TYPE_ICS_XIVE,
> +.parent = TYPE_ICS_BASE,
> +.instance_size = sizeof(XiveICSState),
> +.class_init = xive_ics_class_init,
> +};
> +
>  /*
>   * Main XIVE object
>   */
> @@ -123,6 +232,7 @@ static const TypeInfo xive_info = {
>  static void xive_register_types(void)
>  {
>  type_register_static(&xive_info);
> +type_register_static(&xive_ics_info);
>  }
>  
>  type_init(xive_register_

Re: [Qemu-devel] [RFC PATCH 07/26] ppc/xive: add MMIO handlers to the XIVE interrupt source

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:20PM +0200, Cédric Le Goater wrote:
> Each interrupt source is associated with a 2-bit state machine called
> an Event State Buffer (ESB). It is controlled by MMIO to trigger
> events.
> 
> See code for more details on the states.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c| 230 
> ++
>  include/hw/ppc/xive.h |   3 +
>  2 files changed, 233 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 9ff14c0da595..816031b8ac81 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -32,6 +32,226 @@ static void xive_icp_irq(XiveICSState *xs, int lisn)
>  }
>  
>  /*
> + * "magic" Event State Buffer (ESB) MMIO offsets.
> + *
> + * Each interrupt source has a 2-bit state machine called ESB
> + * which can be controlled by MMIO. It's made of 2 bits, P and
> + * Q. P indicates that an interrupt is pending (has been sent
> + * to a queue and is waiting for an EOI). Q indicates that the
> + * interrupt has been triggered while pending.
> + *
> + * This acts as a coalescing mechanism in order to guarantee
> + * that a given interrupt only occurs at most once in a queue.
> + *
> + * When doing an EOI, the Q bit will indicate if the interrupt
> + * needs to be re-triggered.
> + *
> + * The following offsets into the ESB MMIO allow to read or
> + * manipulate the PQ bits. They must be used with an 8-bytes
> + * load instruction. They all return the previous state of the
> + * interrupt (atomically).
> + *
> + * Additionally, some ESB pages support doing an EOI via a
> + * store at 0 and some ESBs support doing a trigger via a
> + * separate trigger page.
> + */
> +#define XIVE_ESB_GET0x800
> +#define XIVE_ESB_SET_PQ_00  0xc00
> +#define XIVE_ESB_SET_PQ_01  0xd00
> +#define XIVE_ESB_SET_PQ_10  0xe00
> +#define XIVE_ESB_SET_PQ_11  0xf00
> +
> +#define XIVE_ESB_VAL_P  0x2
> +#define XIVE_ESB_VAL_Q  0x1
> +
> +#define XIVE_ESB_RESET  0x0
> +#define XIVE_ESB_PENDING0x2
> +#define XIVE_ESB_QUEUED 0x3
> +#define XIVE_ESB_OFF0x1
> +
> +static uint8_t xive_pq_get(XIVE *x, uint32_t lisn)
> +{
> +uint32_t idx = lisn;
> +uint32_t byte = idx / 4;
> +uint32_t bit  = (idx % 4) * 2;
> +uint8_t* pqs = (uint8_t *) x->sbe;
> +
> +return (pqs[byte] >> bit) & 0x3;
> +}
> +
> +static void xive_pq_set(XIVE *x, uint32_t lisn, uint8_t pq)
> +{
> +uint32_t idx = lisn;
> +uint32_t byte = idx / 4;
> +uint32_t bit  = (idx % 4) * 2;
> +uint8_t* pqs = (uint8_t *) x->sbe;
> +
> +pqs[byte] &= ~(0x3 << bit);
> +pqs[byte] |= (pq & 0x3) << bit;

I know it probably amounts to the same thing given the context, but
I'd be more comfortable with a temporary and an obviously atomic
update than two writes to the real state variable.

> +}
> +
> +static bool xive_pq_eoi(XIVE *x, uint32_t lisn)
> +{
> +uint8_t old_pq = xive_pq_get(x, lisn);
> +
> +switch (old_pq) {
> +case XIVE_ESB_RESET:
> +xive_pq_set(x, lisn, XIVE_ESB_RESET);
> +return false;
> +case XIVE_ESB_PENDING:
> +xive_pq_set(x, lisn, XIVE_ESB_RESET);
> +return false;
> +case XIVE_ESB_QUEUED:
> +xive_pq_set(x, lisn, XIVE_ESB_PENDING);
> +return true;
> +case XIVE_ESB_OFF:
> +xive_pq_set(x, lisn, XIVE_ESB_OFF);
> +return false;
> +default:
> + g_assert_not_reached();
> +}
> +}
> +
> +static bool xive_pq_trigger(XIVE *x, uint32_t lisn)
> +{
> +uint8_t old_pq = xive_pq_get(x, lisn);
> +
> +switch (old_pq) {
> +case XIVE_ESB_RESET:
> +xive_pq_set(x, lisn, XIVE_ESB_PENDING);
> +return true;
> +case XIVE_ESB_PENDING:
> +xive_pq_set(x, lisn, XIVE_ESB_QUEUED);
> +return true;
> +case XIVE_ESB_QUEUED:
> +xive_pq_set(x, lisn, XIVE_ESB_QUEUED);
> +return true;
> +case XIVE_ESB_OFF:
> +xive_pq_set(x, lisn, XIVE_ESB_OFF);
> +return false;
> +default:
> + g_assert_not_reached();
> +}
> +}
> +
> +/*
> + * XIVE Interrupt Source MMIOs
> + */
> +static void xive_ics_eoi(XiveICSState *xs, uint32_t srcno)
> +{
> +ICSIRQState *irq = &ICS_BASE(xs)->irqs[srcno];
> +
> +if (irq->flags & XICS_FLAGS_IRQ_LSI) {
> +irq->status &= ~XICS_STATUS_SENT;
> +}
> +}
> +
> +/* TODO: handle second page */

Is this comment still relevent?

> +static uint64_t xive_esb_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +XiveICSState *xs = ICS_XIVE(opaque);
> +XIVE *x = xs->xive;
> +uint32_t offset = addr & 0xF00;
> +uint32_t srcno = addr >> xs->esb_shift;
> +uint32_t lisn = srcno + ICS_BASE(xs)->offset;
> +XiveIVE *ive;
> +uint64_t ret = -1;
> +
> +ive = xive_get_ive(x, lisn);
> +if (!ive || !(ive->w & IVE_VALID))  {
> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
> +goto out;
> +}
> +
> +if (srcno >

Re: [Qemu-devel] [PATCH v1 3/6] target/s390x: implement spm (SET PROGRAM MASK)

2017-07-23 Thread Richard Henderson

On 07/21/2017 05:56 AM, David Hildenbrand wrote:

+tcg_gen_extrl_i64_i32(cc, o->in1);
+tcg_gen_shri_i32(cc, cc, 28);
+tcg_gen_andi_i32(cc, cc, 0x3ul);
+tcg_gen_mov_i32(cc_op, cc);
+tcg_temp_free_i32(cc);
+set_cc_static(s);


 tcg_gen_extrl_i64_i32(cc_op, o->in1);
 tcg_gen_extract_i32(cc_op, cc_op, 28, 2);


Otherwise,

Reviewed-by: Richard Henderson 


r~



Re: [Qemu-devel] [PATCH v1 4/6] target/s390x: move wrap_address to cpu.h

2017-07-23 Thread Richard Henderson

On 07/21/2017 05:56 AM, David Hildenbrand wrote:

Signed-off-by: David Hildenbrand
---
  target/s390x/cpu.h| 14 ++
  target/s390x/mem_helper.c | 14 --
  2 files changed, 14 insertions(+), 14 deletions(-)


Reviewed-by: Richard Henderson 

Although another header, private to the helpers, might be better...


r~




Re: [Qemu-devel] >256 Virtio-net-pci hotplug Devices

2017-07-23 Thread Kinsella, Ray

So as it turns out at 512 devices, it is nothing to do SeaBIOS, it was the 
Kernel again.
It is taking quite a while to startup, a little over two hours (7489 seconds).
The main culprits appear to be enumerating/initializing the PCI Express ports 
and enabling interrupts. 

The PCI Express Root Ports are taking a long time to enumerate/ initializing.
42 minutes in total=2579/60=64 ports in total, 40 seconds each.

[   50.612822] pci_bus :80: root bus resource [bus 80-c1]
[  172.345361] pci :80:00.0: PCI bridge to [bus 81]
...
[ 2724.734240] pci :80:08.0: PCI bridge to [bus c1]
[ 2751.154702] ACPI: Enabled 2 GPEs in block 00 to 3F

I assume the 1 hour (3827 seconds) below is being spent enabling interrupts.

[ 2899.394288] ACPI: PCI Interrupt Link [GSIG] enabled at IRQ 22
[ 2899.531324] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23
[ 2899.534778] ACPI: PCI Interrupt Link [GSIE] enabled at IRQ 20
[ 6726.914388] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[ 6726.937932] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 
16550A
[ 6726.964699] Linux agpgart interface v0.103

There finally there is another 20 minutes to find in the boot.

[ 7489.202589] virtio_net virtio515 enp193s0f0: renamed from eth513

Poky (Yocto Project Reference Distro) 2.3 qemux86-64 ttyS0

qemux86-64 login: root

I will remove the virtio-net-pci devices and hotplug them instead.
In theory it should improve boot time, at expense of incurring some of these 
costs at runtime. 

Ray K

-Original Message-
From: Kevin O'Connor [mailto:ke...@koconnor.net] 
Sent: Sunday, July 23, 2017 1:05 PM
To: Marcel Apfelbaum ; Kinsella, Ray 
Cc: qemu-devel@nongnu.org; seab...@seabios.org; Gerd Hoffmann 
; Michael Tsirkin 
Subject: Re: >256 Virtio-net-pci hotplug Devices

On Sun, Jul 23, 2017 at 07:28:01PM +0300, Marcel Apfelbaum wrote:
> On 22/07/2017 2:57, Kinsella, Ray wrote:
> > When scaling up to 512 Virtio-net devices SeaBIOS appears to really 
> > slow down when configuring PCI Config space - haven't manage to get 
> > this to work yet.

If there is a slowdown in SeaBIOS, it would help to produce a log with timing 
information - see:
https://www.seabios.org/Debugging#Timing_debug_messages

It may also help to increase the debug level in SeaBIOS to get more fine 
grained timing reports.

-Kevin



Re: [Qemu-devel] [RFC PATCH 08/26] ppc/xive: add flags to the XIVE interrupt source

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:21PM +0200, Cédric Le Goater wrote:
> These flags define some characteristics of the source :
> 
>  - XIVE_SRC_H_INT_ESB  the Event State Buffer are controlled with a
>specific hcall H_INT_ESB

What's the other option?

>  - XIVE_SRC_LSILSI or MSI source

Hrm.  This definitely duplicates info that is in the XICS per irq
state which you're re-using (and which you're using in the xive code
at this point).

>  - XIVE_SRC_TRIGGERthe full function page supports trigger
>  - XIVE_SRC_STORE_EOI  EOI can with a store.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c| 1 +
>  include/hw/ppc/xive.h | 9 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 816031b8ac81..8f8bb8b787bd 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -345,6 +345,7 @@ static Property xive_ics_properties[] = {
>  DEFINE_PROP_UINT32("nr-irqs", ICSState, nr_irqs, 0),
>  DEFINE_PROP_UINT32("irq-base", ICSState, offset, 0),
>  DEFINE_PROP_UINT32("shift", XiveICSState, esb_shift, 0),
> +DEFINE_PROP_UINT64("flags", XiveICSState, flags, 0),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 5303d96f5f59..1178300c9df3 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -30,9 +30,18 @@ typedef struct XiveICSState XiveICSState;
>  #define TYPE_ICS_XIVE "xive-source"
>  #define ICS_XIVE(obj) OBJECT_CHECK(XiveICSState, (obj), TYPE_ICS_XIVE)
>  
> +/*
> + * XIVE Interrupt source flags
> + */
> +#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60))
> +#define XIVE_SRC_LSI   (1ull << (63 - 61))
> +#define XIVE_SRC_TRIGGER   (1ull << (63 - 62))
> +#define XIVE_SRC_STORE_EOI (1ull << (63 - 63))
> +
>  struct XiveICSState {
>  ICSState parent_obj;
>  
> +uint64_t flags;
>  uint32_t esb_shift;
>  MemoryRegion esb_iomem;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH 09/26] ppc/xive: add an overall memory region for the ESBs

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:22PM +0200, Cédric Le Goater wrote:
> Each source adds its own ESB mempry region to the overall ESB memory
> region of the controller. It will be mapped in the CPU address space
> when XIVE is activated.
> 
> The default mapping address for the ESB memory region is the same one
> used on baremetal.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive-internal.h |  5 +
>  hw/intc/xive.c  | 44 +++-
>  2 files changed, 48 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> index 8e755aa88a14..c06be823aad0 100644
> --- a/hw/intc/xive-internal.h
> +++ b/hw/intc/xive-internal.h
> @@ -98,6 +98,7 @@ struct XIVE {
>  SysBusDevice parent;
>  
>  /* Properties */
> +uint32_t chip_id;

So there is a XIVE object per chip.  How does this work on PAPR?  One
logical chip/XIVE, or something more complex?

>  uint32_t nr_targets;
>  
>  /* IRQ number allocator */
> @@ -111,6 +112,10 @@ struct XIVE {
>  void *sbe;
>  XiveIVE  *ivt;
>  XiveEQ   *eqdt;
> +
> +/* ESB and TIMA memory location */
> +hwaddr   vc_base;
> +MemoryRegion esb_iomem;
>  };
>  
>  void xive_reset(void *dev);
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 8f8bb8b787bd..a1cb87a07b76 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -312,6 +312,7 @@ static void xive_ics_realize(ICSState *ics, Error **errp)
>  XiveICSState *xs = ICS_XIVE(ics);
>  Object *obj;
>  Error *err = NULL;
> +XIVE *x;

I don't really like just 'x' for a context variable like this (as
opposed to a temporary).

>  
>  obj = object_property_get_link(OBJECT(xs), "xive", &err);
>  if (!obj) {
> @@ -319,7 +320,7 @@ static void xive_ics_realize(ICSState *ics, Error **errp)
> __func__, error_get_pretty(err));
>  return;
>  }
> -xs->xive = XIVE(obj);
> +x = xs->xive = XIVE(obj);
>  
>  if (!ics->nr_irqs) {
>  error_setg(errp, "Number of interrupts needs to be greater 0");
> @@ -338,6 +339,11 @@ static void xive_ics_realize(ICSState *ics, Error **errp)
>"xive.esb",
>(1ull << xs->esb_shift) * ICS_BASE(xs)->nr_irqs);
>  
> +/* Install the ESB memory region in the overall one */
> +memory_region_add_subregion(&x->esb_iomem,
> +ICS_BASE(xs)->offset * (1 << xs->esb_shift),
> +&xs->esb_iomem);
> +
>  qemu_register_reset(xive_ics_reset, xs);
>  }
>  
> @@ -375,6 +381,32 @@ static const TypeInfo xive_ics_info = {
>   */
>  #define MAX_HW_IRQS_ENTRIES (8 * 1024)
>  
> +/* VC BAR contains set translations for the ESBs and the EQs. */
> +#define VC_BAR_DEFAULT   0x100ull
> +#define VC_BAR_SIZE  0x080ull
> +
> +#define P9_MMIO_BASE 0x006ull
> +#define P9_CHIP_BASE(id) (P9_MMIO_BASE | (0x400ull * (uint64_t) 
> (id)))

chip-based MMIO addresses leaking into the PAPR model seems like it
might not be what you want

> +static uint64_t xive_esb_default_read(void *p, hwaddr offset, unsigned size)
> +{
> +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " [%u]\n",
> +  __func__, offset, size);
> +return 0;
> +}
> +
> +static void xive_esb_default_write(void *opaque, hwaddr offset, uint64_t 
> value,
> +unsigned size)
> +{
> +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " <- 0x%" PRIx64 " 
> [%u]\n",
> +  __func__, offset, value, size);
> +}
> +
> +static const MemoryRegionOps xive_esb_default_ops = {
> +.read = xive_esb_default_read,
> +.write = xive_esb_default_write,
> +.endianness = DEVICE_BIG_ENDIAN,
> +};
>  
>  void xive_reset(void *dev)
>  {
> @@ -435,10 +467,20 @@ static void xive_realize(DeviceState *dev, Error **errp)
>  x->eqdt = g_malloc0(x->nr_targets * XIVE_EQ_PRIORITY_COUNT *
>  sizeof(XiveEQ));
>  
> +/* VC BAR. That's the full window but we will only map the
> + * subregions in use. */
> +x->vc_base = (hwaddr)(P9_CHIP_BASE(x->chip_id) | VC_BAR_DEFAULT);
> +
> +/* install default memory region handlers to log bogus access */
> +memory_region_init_io(&x->esb_iomem, NULL, &xive_esb_default_ops,
> +  NULL, "xive.esb", VC_BAR_SIZE);
> +sysbus_init_mmio(SYS_BUS_DEVICE(dev), &x->esb_iomem);
> +
>  qemu_register_reset(xive_reset, dev);
>  }
>  
>  static Property xive_properties[] = {
> +DEFINE_PROP_UINT32("chip-id", XIVE, chip_id, 0),
>  DEFINE_PROP_UINT32("nr-targets", XIVE, nr_targets, 0),
>  DEFINE_PROP_END_OF_LIST(),
>  };

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
D

Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-23 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 13:28 +1000, David Gibson wrote:
> > So yes, in PAPR there's an "allocator" because the hypervisor will
> > create a guest "virtual" (or logical to use PAPR terminology) interrupt
> > number space, in order to represents the various interrupts into the
> > guest.
> 
> Ok, but are each of those logical irqs bound to a specific device/PHB
> line/whatever, or can they be configured by the guest?

So for clarity, let's first establish the terminology :-)

 - HW number is a HW interrupt number on a "bare metal" system or
powernv guest. For now we will ignore those, they are effectively a
side effect of how skiboot configure the XIVE and qemu per-se doesn't
allocate them.

 - A logical number is a "guest physical" interrupt number for a PAPR
guest. These fall into roughly 2 categories at the moment:

* "interrupts" (or related) properties in the DT, typically
interrupts for a PCI device, ranges of MSIs etc... that correspond to
HW sources from a PHB.

* "generic IPIs". Those are ranges of "generic" interrupts that the
hypervisor gives the guest. On a real system, they correspond to chunks
allocated off a HW facility for generic interrupts. Generic interrupts
are the same as normal interrupts from the prespective of
managing/receiving them, but are "triggered" by an MMIO to a certain HW
page. There's a DT property telling the guest the interrupt number
ranges for these guys.

So that logical number above is what a PAPR guest obtains from the DT
and uses for the various H-call used to manage and configure interrupt
sources.

In addition, the XIVE supports renumbering the interrupt number that
you obtain in the queues. Both bare metal linux, KVM and guests make
use of this. This only changes the number you observe in a queue when
you receive an interrupt, it has no effect on the HW number or logical
number used for the various management calls.

This is used by Linux so that:

  - On bare metal systems or PAPR guest with "exploitation mode" (ie,
PAPR guest directly using the XIVE), we put the linux interrupt number
in there as to avoid the reverse-mapping done by linux otherwise when
receiving an interrupt.

  - On PARP guests using the legacy hcalls, KVM configures the logical
number there.

> > Those numbers however are just tokens, they don't have to represent any
> > real HW concept. So they can be "allocated" in a rather fixed way, for
> > example, you could have something like a fixed map where you put all
> > the PCI interrupts at a certain number (a factor of the PHB# with room
> > or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
> > based would have a chunk of "general purpose" IPIs (for use for actual
> > IPIs and for other things to come). And a range for the virtual device
> > interrupts for example. Or you can just use an allocator.
> 
> Hm.  So what I'm meaning by an "allocator" is something at least
> partially dynamic.  Something you say "give me an irq" and it gives
> you the next available or similar.  As opposed to any mapping from
> devices to (logical) irqs, which the machine will need to supply one
> way or another.

For the sake of repeatability/migration etc... I think a mapping is
better than an allocator.  IE, a fixed number scheme so that the range
of interrupts for PHB#x is always a fixed function of x.

We can fix the number of "generic" interrupts given to a guest. The
only requirements from a PAPR perspective is that there should be at
least as many as there are possible threads in the guest so they can be
used as IPIs.

But we may need more for other things. We can make this a machine
parameter with a default value of something like 4096. If we call N
that number of extra generic interrupts, then the number of generic
interrutps would be #possible-vcpu's + N, or something like that.

> > But it's fundamentally an allocator that sits in the hypervisor, so in
> > our case, I would say in the spapr "component" of XIVE, rather than the
> > XIVE HW model itself.
> 
> Maybe..

You are right in that a mapping is a better term than an allocator
here.

> > Now what Cedric did, because XIVE is very complex and we need something
> > for PAPR quickly, is not a complete HW model, but a somewhat simplified
> > one that only handles what PAPR exposes. So in that case where the
> > allocator sits is a bit of a TBD...
> 
> Hm, ok.  My concern here is that "dynamic" allocation of irqs at the
> machine type level needs extreme caution, or the irqs may not be
> stable which will generally break migration.

Yes you are right. We should probably create a more "static" scheme.

Cheers,
Ben.




Re: [Qemu-devel] About virtio device hotplug in Q35!

2017-07-23 Thread Marcel Apfelbaum

On 24/07/2017 4:47, Zhong Yang wrote:

Hello all,



Hi,


When we did virtio device hotplug in Q35 platform, which always failed in 
hotplug.



Can we please see the QEMU command line and the description
of the hotplug steps?


Would you please tell me how to configure VM to make virtio device hotplug work 
in Q35 platform?  Many thanks!



The virtio devices can be hot-plugged into PCIe Root Ports:
HMP example:
 Start the VM with several PCIe Root Ports:
-M q35 -monitor stdio \
 -device pcie-root-port,id=rp1 \
 -device pcie-root port,id=rp2 \
 ...
 Then from QEMU's hmp:
 device-add virtio-net-pci,id=n1,bus=rp1
 ...


I hope it helps,
Marcel




Regards,

Yang zhong






Re: [Qemu-devel] [RFC PATCH 10/26] ppc/xive: record interrupt source MMIO address for hcalls

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:23PM +0200, Cédric Le Goater wrote:
> The address of the MMIO page through which the Event State Buffer is
> controlled is returned to the guest by the H_INT_GET_SOURCE_INFO hcall.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c| 3 +++
>  include/hw/ppc/xive.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index a1cb87a07b76..0db97fd33981 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -344,6 +344,9 @@ static void xive_ics_realize(ICSState *ics, Error **errp)
>  ICS_BASE(xs)->offset * (1 << xs->esb_shift),
>  &xs->esb_iomem);
>  
> +/* Record base address which is needed by the hcalls */
> +xs->esb_base = x->vc_base + ICS_BASE(xs)->offset * (1 << xs->esb_shift);

This doesn't seem like it needs to be stored in the persistent object
- it can be calculated when the hcall is made.  Plus if it's for the
hcll it only makes sense for spapr.

>  qemu_register_reset(xive_ics_reset, xs);
>  }
>  
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 1178300c9df3..b06bc861b845 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -43,6 +43,7 @@ struct XiveICSState {
>  
>  uint64_t flags;
>  uint32_t esb_shift;
> +hwaddr   esb_base;
>  MemoryRegion esb_iomem;
>  
>  XIVE *xive;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH 11/26] ppc/xics: introduce a print_info() handler to the ICS and ICP objects

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:24PM +0200, Cédric Le Goater wrote:
> This handler will be used to customize the ouput of the XIVE interrupt
> source and presenter objects.

I'm not really happy with this without having a clear idea of where
this is heading - are you trying to share ICP and or ICS object
classes between XICS and XIVE, or will they eventually be separated
again?

> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xics.c| 36 
>  include/hw/ppc/xics.h |  2 ++
>  2 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/intc/xics.c b/hw/intc/xics.c
> index faa5c631f655..7837c2022b4a 100644
> --- a/hw/intc/xics.c
> +++ b/hw/intc/xics.c
> @@ -40,18 +40,26 @@
>  
>  void icp_pic_print_info(ICPState *icp, Monitor *mon)
>  {
> +ICPStateClass *k = ICP_GET_CLASS(icp);
>  int cpu_index = icp->cs ? icp->cs->cpu_index : -1;
>  
>  if (!icp->output) {
>  return;
>  }
> -monitor_printf(mon, "CPU %d XIRR=%08x (%p) PP=%02x MFRR=%02x\n",
> -   cpu_index, icp->xirr, icp->xirr_owner,
> -   icp->pending_priority, icp->mfrr);
> +
> +monitor_printf(mon, "CPU %d ", cpu_index);
> +if (k->print_info) {
> +k->print_info(icp, mon);
> +} else {
> +monitor_printf(mon, "XIRR=%08x (%p) PP=%02x MFRR=%02x\n",
> +   icp->xirr, icp->xirr_owner,
> +   icp->pending_priority, icp->mfrr);
> +}
>  }
>  
>  void ics_pic_print_info(ICSState *ics, Monitor *mon)
>  {
> +ICSStateClass *k = ICS_BASE_GET_CLASS(ics);
>  uint32_t i;
>  
>  monitor_printf(mon, "ICS %4x..%4x %p\n",
> @@ -61,17 +69,21 @@ void ics_pic_print_info(ICSState *ics, Monitor *mon)
>  return;
>  }
>  
> -for (i = 0; i < ics->nr_irqs; i++) {
> -ICSIRQState *irq = ics->irqs + i;
> +if (k->print_info) {
> +k->print_info(ics, mon);
> +} else {
> +for (i = 0; i < ics->nr_irqs; i++) {
> +ICSIRQState *irq = ics->irqs + i;
>  
> -if (!(irq->flags & XICS_FLAGS_IRQ_MASK)) {
> -continue;
> +if (!(irq->flags & XICS_FLAGS_IRQ_MASK)) {
> +continue;
> +}
> +monitor_printf(mon, "  %4x %s %02x %02x\n",
> +   ics->offset + i,
> +   (irq->flags & XICS_FLAGS_IRQ_LSI) ?
> +   "LSI" : "MSI",
> +   irq->priority, irq->status);
>  }
> -monitor_printf(mon, "  %4x %s %02x %02x\n",
> -   ics->offset + i,
> -   (irq->flags & XICS_FLAGS_IRQ_LSI) ?
> -   "LSI" : "MSI",
> -   irq->priority, irq->status);
>  }
>  }
>  
> diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
> index 28d248abad61..902f3bfd0e33 100644
> --- a/include/hw/ppc/xics.h
> +++ b/include/hw/ppc/xics.h
> @@ -69,6 +69,7 @@ struct ICPStateClass {
>  void (*pre_save)(ICPState *icp);
>  int (*post_load)(ICPState *icp, int version_id);
>  void (*reset)(ICPState *icp);
> +void (*print_info)(ICPState *icp, Monitor *mon);
>  };
>  
>  struct ICPState {
> @@ -119,6 +120,7 @@ struct ICSStateClass {
>  void (*reject)(ICSState *s, uint32_t irq);
>  void (*resend)(ICSState *s);
>  void (*eoi)(ICSState *s, uint32_t irq);
> +void (*print_info)(ICSState *s, Monitor *mon);
>  };
>  
>  struct ICSState {

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[Qemu-devel] [GIT PULL for qemu-pseries] pseries: Update SLOF firmware image

2017-07-23 Thread Alexey Kardashevskiy
The following changes since commit 91939262ffcd3c85ea6a4793d3029326eea1d649:

  configure: Drop ancient Solaris 9 and earlier support (2017-07-21 15:04:05 
+0100)

are available in the git repository at:

  g...@github.com:aik/qemu.git tags/qemu-slof-20170724

for you to fetch changes up to 9f581865da041772eb2e1700c30d75489fe2c971:

  pseries: Update SLOF firmware image (2017-07-24 15:25:04 +1000)


Alexey Kardashevskiy (1):
  pseries: Update SLOF firmware image

 pc-bios/README   |   2 +-
 pc-bios/slof.bin | Bin 902120 -> 905200 bytes
 roms/SLOF|   2 +-
 3 files changed, 2 insertions(+), 2 deletions(-)


*** Note: this is not for master, this is for pseries



Re: [Qemu-devel] [RFC PATCH 04/26] ppc/xive: introduce a skeleton for the XIVE interrupt controller model

2017-07-23 Thread David Gibson
On Mon, Jul 24, 2017 at 03:04:05PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2017-07-24 at 13:28 +1000, David Gibson wrote:
> > > So yes, in PAPR there's an "allocator" because the hypervisor will
> > > create a guest "virtual" (or logical to use PAPR terminology) interrupt
> > > number space, in order to represents the various interrupts into the
> > > guest.
> > 
> > Ok, but are each of those logical irqs bound to a specific device/PHB
> > line/whatever, or can they be configured by the guest?
> 
> So for clarity, let's first establish the terminology :-)
> 
>  - HW number is a HW interrupt number on a "bare metal" system or
> powernv guest. For now we will ignore those, they are effectively a
> side effect of how skiboot configure the XIVE and qemu per-se doesn't
> allocate them.
> 
>  - A logical number is a "guest physical" interrupt number for a PAPR
> guest. These fall into roughly 2 categories at the moment:
> 
> * "interrupts" (or related) properties in the DT, typically
> interrupts for a PCI device, ranges of MSIs etc... that correspond to
> HW sources from a PHB.

Ok, I think this is the one I've mostly been thinking of.

> * "generic IPIs". Those are ranges of "generic" interrupts that the
> hypervisor gives the guest. On a real system, they correspond to chunks
> allocated off a HW facility for generic interrupts. Generic interrupts
> are the same as normal interrupts from the prespective of
> managing/receiving them, but are "triggered" by an MMIO to a certain HW
> page. There's a DT property telling the guest the interrupt number
> ranges for these guys.
> 
> So that logical number above is what a PAPR guest obtains from the DT
> and uses for the various H-call used to manage and configure interrupt
> sources.

Ok.

> In addition, the XIVE supports renumbering the interrupt number that
> you obtain in the queues. Both bare metal linux, KVM and guests make
> use of this. This only changes the number you observe in a queue when
> you receive an interrupt, it has no effect on the HW number or logical
> number used for the various management calls.

Ok.

> This is used by Linux so that:
> 
>   - On bare metal systems or PAPR guest with "exploitation mode" (ie,
> PAPR guest directly using the XIVE), we put the linux interrupt number
> in there as to avoid the reverse-mapping done by linux otherwise when
> receiving an interrupt.
> 
>   - On PARP guests using the legacy hcalls, KVM configures the logical
> number there.

Ok.

> > > Those numbers however are just tokens, they don't have to represent any
> > > real HW concept. So they can be "allocated" in a rather fixed way, for
> > > example, you could have something like a fixed map where you put all
> > > the PCI interrupts at a certain number (a factor of the PHB# with room
> > > or a fix number per PHB, maybe 16K or so, the HW does 4K max). Another
> > > based would have a chunk of "general purpose" IPIs (for use for actual
> > > IPIs and for other things to come). And a range for the virtual device
> > > interrupts for example. Or you can just use an allocator.
> > 
> > Hm.  So what I'm meaning by an "allocator" is something at least
> > partially dynamic.  Something you say "give me an irq" and it gives
> > you the next available or similar.  As opposed to any mapping from
> > devices to (logical) irqs, which the machine will need to supply one
> > way or another.
> 
> For the sake of repeatability/migration etc... I think a mapping is
> better than an allocator.  IE, a fixed number scheme so that the range
> of interrupts for PHB#x is always a fixed function of x.

Yes, I agree.  In fact that's pretty much exactly the point I'm trying
to make.

Can we assign our logical numbers sparsely, or will that cause other
problems?

Note that for PAPR we also have the question of finding logical
interrupts for legacy PAPR VIO devices.

> We can fix the number of "generic" interrupts given to a guest. The
> only requirements from a PAPR perspective is that there should be at
> least as many as there are possible threads in the guest so they can be
> used as IPIs.

Ok.  If we can do things sparsely, allocating these well away from the
hw interrupts would make things easier.

> But we may need more for other things. We can make this a machine
> parameter with a default value of something like 4096. If we call N
> that number of extra generic interrupts, then the number of generic
> interrutps would be #possible-vcpu's + N, or something like that.

That seems reasonable.

> > > But it's fundamentally an allocator that sits in the hypervisor, so in
> > > our case, I would say in the spapr "component" of XIVE, rather than the
> > > XIVE HW model itself.
> > 
> > Maybe..
> 
> You are right in that a mapping is a better term than an allocator
> here.
> 
> > > Now what Cedric did, because XIVE is very complex and we need something
> > > for PAPR quickly, is not a complete HW model, but a somewhat simplified
> > > one that only handles what PAPR 

Re: [Qemu-devel] [PATCH qemu v4 1/3] spapr_iommu: Realloc guest visible TCE table when hot(un)plugging vfio-pci

2017-07-23 Thread David Gibson
On Thu, Jul 20, 2017 at 05:22:29PM +1000, Alexey Kardashevskiy wrote:
> This replaces g_malloc() with spapr_tce_alloc_table() as this is
> the standard way of allocating tables and this allows moving the table
> back to KVM when unplugging a VFIO PCI device and VFIO TCE acceleration
> support is not present in the KVM.

So, I like the idea here, and I think it will work for now, but one
thing concerns me.

AIUI, your future plans include allowing in-kernel accelerated TCE
tables even with VFIO devices.  When that happens, we might have an
in-kernel table both before and after a change in the "need_vfio"
state.

But you won't be able to have two in-kernel tables allocated at once,
whereas this code relies on having both the old and new tables at once
so it can copy the contents over.

How do you intend to handle that?

> Although spapr_tce_alloc_table() is expected to fail with EBUSY
> if called when previous fd is not closed yet, in practice we will not
> see it because cap_spapr_vfio is false at the moment.

I don't follow this.  How would cap_spapr_vfio be changing at any
point?

> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  hw/ppc/spapr_iommu.c | 35 ++-
>  1 file changed, 14 insertions(+), 21 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index e614621a83..307dc3021e 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -275,33 +275,26 @@ static int spapr_tce_table_realize(DeviceState *dev)
>  void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool need_vfio)
>  {
>  size_t table_size = tcet->nb_table * sizeof(uint64_t);
> -void *newtable;
> +uint64_t *oldtable;
> +int newfd = -1;
>  
> -if (need_vfio == tcet->need_vfio) {
> -/* Nothing to do */
> -return;
> -}
> +g_assert(need_vfio != tcet->need_vfio);
>  
> -if (!need_vfio) {
> -/* FIXME: We don't support transition back to KVM accelerated
> - * TCEs yet */
> -return;
> -}
> +tcet->need_vfio = need_vfio;
>  
> -tcet->need_vfio = true;
> +oldtable = tcet->table;
>  
> -if (tcet->fd < 0) {
> -/* Table is already in userspace, nothing to be do */
> -return;
> -}
> +tcet->table = spapr_tce_alloc_table(tcet->liobn,
> +tcet->page_shift,
> +tcet->bus_offset,
> +tcet->nb_table,
> +&newfd,
> +need_vfio);
> +memcpy(tcet->table, oldtable, table_size);
>  
> -newtable = g_malloc(table_size);
> -memcpy(newtable, tcet->table, table_size);
> +spapr_tce_free_table(oldtable, tcet->fd, tcet->nb_table);
>  
> -kvmppc_remove_spapr_tce(tcet->table, tcet->fd, tcet->nb_table);
> -
> -tcet->fd = -1;
> -tcet->table = newtable;
> +tcet->fd = newfd;
>  }
>  
>  sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC PATCH 06/26] ppc/xive: introduce a XIVE interrupt source model

2017-07-23 Thread Alexey Kardashevskiy
On 24/07/17 14:02, David Gibson wrote:
> On Wed, Jul 05, 2017 at 07:13:19PM +0200, Cédric Le Goater wrote:
>> This is very similar to the current ICS_SIMPLE model in XICS. We try
>> to reuse the ICS model because the sPAPR machine is tied to the
>> XICSFabric interface and should be using a common framework to switch
>> from one controller model to another: XICS <-> XIVE.
> 
> Hm.  I'm not entirely concvinced re-using the xics ICSState class in
> this way is a good idea, though maybe it's a reasonable first step.
> With this patch alone some code is shared, but there are some real
> uglies around the edges.


Agree, using the "ICS" term in XIVE is quite confusing as "ICS" is not
mentioned in neither XIVE nor P9 specs.

> 
> Seems to me at least long term you need to either 1) make the XIVE ics
> separate, even if it has similarities to the XICS one or 2) truly
> unify them, with a common base type and methods to handle the
> differences.
> 
> 
>> The next patch will introduce the MMIO handlers to interact with XIVE
>> interrupt sources.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  hw/intc/xive.c| 110 
>> ++
>>  include/hw/ppc/xive.h |  12 ++
>>  2 files changed, 122 insertions(+)
>>
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 5b14d8155317..9ff14c0da595 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -26,6 +26,115 @@
>>  
>>  #include "xive-internal.h"
>>  
>> +static void xive_icp_irq(XiveICSState *xs, int lisn)
>> +{
>> +
>> +}
>> +
>> +/*
>> + * XIVE Interrupt Source
>> + */
>> +static void xive_ics_set_irq_msi(XiveICSState *xs, int srcno, int val)
>> +{
>> +if (val) {
>> +xive_icp_irq(xs, srcno + ICS_BASE(xs)->offset);
>> +}
>> +}
>> +
>> +static void xive_ics_set_irq_lsi(XiveICSState *xs, int srcno, int val)
>> +{
>> +ICSIRQState *irq = &ICS_BASE(xs)->irqs[srcno];
>> +
>> +if (val) {
>> +irq->status |= XICS_STATUS_ASSERTED;
>> +} else {
>> +irq->status &= ~XICS_STATUS_ASSERTED;
>> +}
>> +
>> +if (irq->status & XICS_STATUS_ASSERTED
>> +&& !(irq->status & XICS_STATUS_SENT)) {
>> +irq->status |= XICS_STATUS_SENT;
>> +xive_icp_irq(xs, srcno + ICS_BASE(xs)->offset);
>> +}
>> +}
>> +
>> +static void xive_ics_set_irq(void *opaque, int srcno, int val)
>> +{
>> +XiveICSState *xs = ICS_XIVE(opaque);
>> +ICSIRQState *irq = &ICS_BASE(xs)->irqs[srcno];
>> +
>> +if (irq->flags & XICS_FLAGS_IRQ_LSI) {
>> +xive_ics_set_irq_lsi(xs, srcno, val);
>> +} else {
>> +xive_ics_set_irq_msi(xs, srcno, val);
>> +}
>> +}
> 
> e.g. you have some code re-use, but still need to more-or-less
> duplicate the set_irq code as above.
> 
>> +static void xive_ics_reset(void *dev)
>> +{
>> +ICSState *ics = ICS_BASE(dev);
>> +int i;
>> +uint8_t flags[ics->nr_irqs];
>> +
>> +for (i = 0; i < ics->nr_irqs; i++) {
>> +flags[i] = ics->irqs[i].flags;
>> +}
>> +
>> +memset(ics->irqs, 0, sizeof(ICSIRQState) * ics->nr_irqs);
>> +
>> +for (i = 0; i < ics->nr_irqs; i++) {
>> +ics->irqs[i].flags = flags[i];
>> +}
> 
> This save, clear, restore is also kind ugly.  I'm also not sure why
> this needs a reset method when I can't find one for the xics ICS.
> 
> Does the xics irqstate structure really cover what you need for xive?
> I had the impression elsewhere that xive had a different priority
> model to xics.  And there's the xics pointer in the icsstate structure
> which is definitely redundant.
> 
>> +}
>> +
>> +static void xive_ics_realize(ICSState *ics, Error **errp)
>> +{
>> +XiveICSState *xs = ICS_XIVE(ics);
>> +Object *obj;
>> +Error *err = NULL;
>> +
>> +obj = object_property_get_link(OBJECT(xs), "xive", &err);
>> +if (!obj) {
>> +error_setg(errp, "%s: required link 'xive' not found: %s",
>> +   __func__, error_get_pretty(err));
>> +return;
>> +}
>> +xs->xive = XIVE(obj);
>> +
>> +if (!ics->nr_irqs) {
>> +error_setg(errp, "Number of interrupts needs to be greater 0");
>> +return;
>> +}
>> +
>> +ics->irqs = g_malloc0(ics->nr_irqs * sizeof(ICSIRQState));
>> +ics->qirqs = qemu_allocate_irqs(xive_ics_set_irq, xs, ics->nr_irqs);
>> +
>> +qemu_register_reset(xive_ics_reset, xs);
>> +}
>> +
>> +static Property xive_ics_properties[] = {
>> +DEFINE_PROP_UINT32("nr-irqs", ICSState, nr_irqs, 0),
>> +DEFINE_PROP_UINT32("irq-base", ICSState, offset, 0),
>> +DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_ics_class_init(ObjectClass *klass, void *data)
>> +{
>> +DeviceClass *dc = DEVICE_CLASS(klass);
>> +ICSStateClass *isc = ICS_BASE_CLASS(klass);
>> +
>> +isc->realize = xive_ics_realize;
>> +
>> +dc->props = xive_ics_properties;
>> +}
>> +
>> +static const TypeInfo xive_ics_info = {
>> +.name = TYPE_ICS_XIVE,
>> +.parent = TYPE_ICS_BASE,
>> +.instance_size = sizeof

[Qemu-devel] [Device passthrough] Is there a way to passthrough PCIE switch/bridge ?

2017-07-23 Thread Bob Chen
- Bob


Re: [Qemu-devel] [RFC PATCH 13/26] ppc/xive: introduce a XIVE interrupt presenter model

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:26PM +0200, Cédric Le Goater wrote:
> Just like the interrupt source model, we try to reuse the ICP model
> because the sPAPR machine is tied to the XICSFabric interface and
> should be using a common framework to switch from one controller model
> to another: XICS <-> XIVE.
> 
> The XIVE interrupt presenter exposes a set of Thread Interrupt
> Management Areas, also called rings, one per different level of
> privilege (four in all). We only expose the OS ring for the sPAPR
> support for the moment. This area is used to handle priority
> management and interrupt acknowledgment among other things.
> 
> The next patch will introduce the MMIO handlers to interact with the
> TIMA, OS only.
> 
> Signed-off-by: Cédric Le Goater 

As with the ICS, I'm not really clear where you're going with this.
Is this a first step towards independent xics and xive ICP objects, or
a first step towards fully unified xics/xive ICPs?

> ---
>  hw/intc/xive-internal.h | 84 
> +
>  hw/intc/xive.c  | 43 +
>  include/hw/ppc/xive.h   | 14 +
>  3 files changed, 141 insertions(+)
> 
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> index c06be823aad0..ba5e648a5258 100644
> --- a/hw/intc/xive-internal.h
> +++ b/hw/intc/xive-internal.h
> @@ -24,6 +24,90 @@
>  #define PPC_BITMASK32(bs, be)   ((PPC_BIT32(bs) - PPC_BIT32(be)) | \
>   PPC_BIT32(bs))
>  
> +/*
> + * Thread Management (aka "TM") registers
> + */
> +
> +/* TM register offsets */
> +#define TM_QW0_USER 0x000 /* All rings */
> +#define TM_QW1_OS   0x010 /* Ring 0..2 */
> +#define TM_QW2_HV_POOL  0x020 /* Ring 0..1 */
> +#define TM_QW3_HV_PHYS  0x030 /* Ring 0..1 */
> +
> +/* Byte offsets inside a QW QW0 QW1 QW2 QW3 */
> +#define TM_NSR  0x0  /*  +   +   -   +  */
> +#define TM_CPPR 0x1  /*  -   +   -   +  */
> +#define TM_IPB  0x2  /*  -   +   +   +  */
> +#define TM_LSMFB0x3  /*  -   +   +   +  */
> +#define TM_ACK_CNT  0x4  /*  -   +   -   -  */
> +#define TM_INC  0x5  /*  -   +   -   +  */
> +#define TM_AGE  0x6  /*  -   +   -   +  */
> +#define TM_PIPR 0x7  /*  -   +   -   +  */
> +
> +#define TM_WORD00x0
> +#define TM_WORD10x4
> +
> +/*
> + * QW word 2 contains the valid bit at the top and other fields
> + * depending on the QW.
> + */
> +#define TM_WORD20x8
> +#define   TM_QW0W2_VU   PPC_BIT32(0)
> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
> +#define   TM_QW1W2_VO   PPC_BIT32(0)
> +#define   TM_QW1W2_OS_CAM   PPC_BITMASK32(8, 31)
> +#define   TM_QW2W2_VP   PPC_BIT32(0)
> +#define   TM_QW2W2_POOL_CAM PPC_BITMASK32(8, 31)
> +#define   TM_QW3W2_VT   PPC_BIT32(0)
> +#define   TM_QW3W2_LP   PPC_BIT32(6)
> +#define   TM_QW3W2_LE   PPC_BIT32(7)
> +#define   TM_QW3W2_TPPC_BIT32(31)
> +
> +/*
> + * In addition to normal loads to "peek" and writes (only when invalid)
> + * using 4 and 8 bytes accesses, the above registers support these
> + * "special" byte operations:
> + *
> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
> + *   - Byte load from QW3[TM_WORD2] - Read VT||0||LP||LE on thrd 0
> + *otherwise VT||000
> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
> + *
> + * Then we have all these "special" CI ops at these offset that trigger
> + * all sorts of side effects:
> + */
> +#define TM_SPC_ACK_EBB  0x800   /* Load8 ack EBB to reg*/
> +#define TM_SPC_ACK_OS_REG   0x810   /* Load16 ack OS irq to reg */
> +#define TM_SPC_PUSH_USR_CTX 0x808   /* Store32 Push/Validate user 
> context */
> +#define TM_SPC_PULL_USR_CTX 0x808   /* Load32 Pull/Invalidate user
> + * context */
> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
> +#define TM_SPC_PULL_OS_CTX  0x818   /* Load32/Load64 Pull/Invalidate OS
> + * context to reg */
> +#define TM_SPC_PULL_POOL_CTX0x828   /* Load32/Load64 Pull/Invalidate Pool
> + * context to reg*/
> +#define TM_SPC_ACK_HV_REG   0x830   /* Load16 ack HV irq to reg */
> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
> + * line */
> +#define TM_SPC_ACK_OS_EL0xc10   /* Store8 ack OS irq to even line */
> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
> + * line */
> +#defin

Re: [Qemu-devel] [RFC PATCH 09/26] ppc/xive: add an overall memory region for the ESBs

2017-07-23 Thread Benjamin Herrenschmidt
On Mon, 2017-07-24 at 14:49 +1000, David Gibson wrote:
> On Wed, Jul 05, 2017 at 07:13:22PM +0200, Cédric Le Goater wrote:
> > Each source adds its own ESB mempry region to the overall ESB memory
> > region of the controller. It will be mapped in the CPU address space
> > when XIVE is activated.
> > 
> > The default mapping address for the ESB memory region is the same one
> > used on baremetal.
> > 
> > Signed-off-by: Cédric Le Goater 
> > ---
> >  hw/intc/xive-internal.h |  5 +
> >  hw/intc/xive.c  | 44 +++-
> >  2 files changed, 48 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> > index 8e755aa88a14..c06be823aad0 100644
> > --- a/hw/intc/xive-internal.h
> > +++ b/hw/intc/xive-internal.h
> > @@ -98,6 +98,7 @@ struct XIVE {
> >  SysBusDevice parent;
> >  
> >  /* Properties */
> > +uint32_t chip_id;
> 
> So there is a XIVE object per chip.  How does this work on PAPR?  One
> logical chip/XIVE, or something more complex?

One global XIVE for PAPR. For the MMIOs, the way it works is that:

 - For MMIOs pertaining to a specific interrupt or queue, there's an H-
call that will return the proper "guest physical" address. For qemu
with KVM we'll have to probably create a single chunk of qemu address
space (a single mem region) that contains individual pages mapped with
MAP_FIXED originating from the different HW bits, we still need to sort
out how exactly we'll do that in practice.

 - For the TIMA (the presentation MMIOs), those are always at the same
physical address for everybody (so for a guest it's a single memory
region we'll map to that physical address), the HW "knows" which HW
thread is talking to it (and the hypervisor tells the HW which vcpu is
running on a given HW thread at a given point in time). That address is
obtained from the device-tree

> >  uint32_t nr_targets;
> >  
> >  /* IRQ number allocator */
> > @@ -111,6 +112,10 @@ struct XIVE {
> >  void *sbe;
> >  XiveIVE  *ivt;
> >  XiveEQ   *eqdt;
> > +
> > +/* ESB and TIMA memory location */
> > +hwaddr   vc_base;
> > +MemoryRegion esb_iomem;
> >  };
> >  
> >  void xive_reset(void *dev);
> > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > index 8f8bb8b787bd..a1cb87a07b76 100644
> > --- a/hw/intc/xive.c
> > +++ b/hw/intc/xive.c
> > @@ -312,6 +312,7 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> >  XiveICSState *xs = ICS_XIVE(ics);
> >  Object *obj;
> >  Error *err = NULL;
> > +XIVE *x;
> 
> I don't really like just 'x' for a context variable like this (as
> opposed to a temporary).
> 
> >  
> >  obj = object_property_get_link(OBJECT(xs), "xive", &err);
> >  if (!obj) {
> > @@ -319,7 +320,7 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> > __func__, error_get_pretty(err));
> >  return;
> >  }
> > -xs->xive = XIVE(obj);
> > +x = xs->xive = XIVE(obj);
> >  
> >  if (!ics->nr_irqs) {
> >  error_setg(errp, "Number of interrupts needs to be greater 0");
> > @@ -338,6 +339,11 @@ static void xive_ics_realize(ICSState *ics, Error 
> > **errp)
> >"xive.esb",
> >(1ull << xs->esb_shift) * ICS_BASE(xs)->nr_irqs);
> >  
> > +/* Install the ESB memory region in the overall one */
> > +memory_region_add_subregion(&x->esb_iomem,
> > +ICS_BASE(xs)->offset * (1 << 
> > xs->esb_shift),
> > +&xs->esb_iomem);
> > +
> >  qemu_register_reset(xive_ics_reset, xs);
> >  }
> >  
> > @@ -375,6 +381,32 @@ static const TypeInfo xive_ics_info = {
> >   */
> >  #define MAX_HW_IRQS_ENTRIES (8 * 1024)
> >  
> > +/* VC BAR contains set translations for the ESBs and the EQs. */
> > +#define VC_BAR_DEFAULT   0x100ull
> > +#define VC_BAR_SIZE  0x080ull
> > +
> > +#define P9_MMIO_BASE 0x006ull
> > +#define P9_CHIP_BASE(id) (P9_MMIO_BASE | (0x400ull * (uint64_t) 
> > (id)))
> 
> chip-based MMIO addresses leaking into the PAPR model seems like it
> might not be what you want
> 
> > +static uint64_t xive_esb_default_read(void *p, hwaddr offset, unsigned 
> > size)
> > +{
> > +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " [%u]\n",
> > +  __func__, offset, size);
> > +return 0;
> > +}
> > +
> > +static void xive_esb_default_write(void *opaque, hwaddr offset, uint64_t 
> > value,
> > +unsigned size)
> > +{
> > +qemu_log_mask(LOG_UNIMP, "%s: 0x%" HWADDR_PRIx " <- 0x%" PRIx64 " 
> > [%u]\n",
> > +  __func__, offset, value, size);
> > +}
> > +
> > +static const MemoryRegionOps xive_esb_default_ops = {
> > +.read = xive_esb_default_read,
> > +.write = xive_esb_default_write,
> > +.endianness = DEVICE_BIG_ENDIAN,
> > +};
> >  
> >  void xive_reset(void *dev

Re: [Qemu-devel] Unified Datagram Socket Transport

2017-07-23 Thread Anton Ivanov
That is an option as well. 

I am travelling this week and will get back to this first thing next week.

A.

On 24 July 2017 05:17:25 CEST, Jason Wang  wrote:
>
>
>On 2017年07月21日 12:25, Anton Ivanov wrote:
>>
>>
>> On 21/07/17 04:55, Jason Wang wrote:
>>>
>>>
>>> On 2017年07月21日 03:12, anton.iva...@cambridgegreys.com wrote:
 Hi all,

 This addresses comments so far except Eric's suggestion to use
 InetSocketAddressBase. If I understand correctly its intended use,
 it will not be of help for protocols which have no port (raw
 sockets - GRE, L2TPv3, etc).

 It also includes a port of the original socket.c transport to
 the new UDST backend. The relevant code is ifdef-ed so there
 should be no effect on other systems.
>>>
>>> This looks sub-optimal. If you want to do this, I would rather 
>>> suggest you just extend the socket dgram backend like what udst did
>now.
>>
>> Apologies, do you mean extend it further to handle the tcp form? 
>
>Not that far, since you try to convert net_dgram_socket_info, I'm 
>thinking just do the all udst in net_dgram_socket. For recvmmsg you can
>
>have transport specific callback for this.
>
>Thanks

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [Qemu-devel] [PULL 6/9] Convert error_report() to warn_report()

2017-07-23 Thread Markus Armbruster
Markus Armbruster  writes:

> Kevin Wolf  writes:
>
>> Am 13.07.2017 um 15:27 hat Markus Armbruster geschrieben:
>>> From: Alistair Francis 
>>> 
>>> Convert all uses of error_report("warning:"... to use warn_report()
>>> instead. This helps standardise on a single method of printing warnings
>>> to the user.
>>> 
>>> All of the warnings were changed using these two commands:
>>> find ./* -type f -exec sed -i \
>>>   's|error_report(".*warning[,:] |warn_report("|Ig' {} +
>>> 
>>> Indentation fixed up manually afterwards.
>>> 
>>> The test-qdev-global-props test case was manually updated to ensure that
>>> this patch passes make check (as the test cases are case sensitive).
>>
>> This patch broke qemu-iotests 051 because it neglected to update the
>> reference output. Not sure if a change of the message was even intended,
>> but with a error location prefix, the order changes:
>>
>> -(qemu) QEMU_PROG: -drive if=scsi,media=cdrom: warning: bus=0,unit=0 is 
>> deprecated with this machine type
>> +(qemu) warning: qemu-system-x86_64: -drive if=scsi,media=cdrom: 
>> bus=0,unit=0 is deprecated with this machine type
>>
>> Personally, I would expect the error location or at least the program
>> name to come first even for warnings.
>
> I'll fix it.
>
> While focusing on something other than block, I forget qemu-iotests
> exist.  My fault, but it's a pretty common fault.  I reiterate my plea
> to include (a sensible subset of) it in "make check".

Oh, 051 isn't run by "make check-block".  It could just as well not
exist then.

What's the recommended way to run all iotests a build of QEMU can run?



Re: [Qemu-devel] [RFC PATCH 14/26] ppc/xive: add MMIO handlers to the XIVE interrupt presenter model

2017-07-23 Thread David Gibson
On Wed, Jul 05, 2017 at 07:13:27PM +0200, Cédric Le Goater wrote:
> The Thread Interrupt Management Area for the OS is mostly used to
> acknowledge interrupts and set the CPPR of the CPU.
> 
> The TIMA is mapped at the same address for each CPU. 'current_cpu' is
> used to retrieve the targeted interrupt presenter object.
> 
> Signed-off-by: Cédric Le Goater 

Am I right in thinking that this shoehorns the XIVE TIMA state into
the existing XICS ICP object.  That.. doesn't seem like a good idea.

> ---
>  hw/intc/xive-internal.h |   4 ++
>  hw/intc/xive.c  | 187 
> 
>  2 files changed, 191 insertions(+)
> 
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> index ba5e648a5258..5e8b78a1ea6a 100644
> --- a/hw/intc/xive-internal.h
> +++ b/hw/intc/xive-internal.h
> @@ -200,6 +200,10 @@ struct XIVE {
>  /* ESB and TIMA memory location */
>  hwaddr   vc_base;
>  MemoryRegion esb_iomem;
> +
> +uint32_t tm_shift;
> +hwaddr   tm_base;
> +MemoryRegion tm_iomem;
>  };
>  
>  void xive_reset(void *dev);
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c08a4f8efb58..82b2f0dcda0b 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -26,6 +26,180 @@
>  
>  #include "xive-internal.h"
>  
> +static uint8_t priority_to_ipb(uint8_t priority)
> +{
> +return priority < XIVE_EQ_PRIORITY_COUNT ? 1 << (7 - priority) : 0;
> +}
> +
> +static uint64_t xive_icp_accept(XiveICPState *xicp)
> +{
> +ICPState *icp = ICP(xicp);
> +uint8_t nsr = xicp->tima_os[TM_NSR];
> +
> +qemu_irq_lower(icp->output);
> +
> +if (xicp->tima_os[TM_NSR] & TM_QW1_NSR_EO) {
> +uint8_t cppr = xicp->tima_os[TM_PIPR];
> +
> +xicp->tima_os[TM_CPPR] = cppr;
> +
> +/* Reset the pending buffer bit */
> +xicp->tima_os[TM_IPB] &= ~priority_to_ipb(cppr);
> +
> +/* Drop Exception bit for OS */
> +xicp->tima_os[TM_NSR] &= ~TM_QW1_NSR_EO;
> +}
> +
> +return (nsr << 8) | xicp->tima_os[TM_CPPR];
> +}
> +
> +static void xive_icp_set_cppr(XiveICPState *xicp, uint8_t cppr)
> +{
> +if (cppr > XIVE_PRIORITY_MAX) {
> +cppr = 0xff;
> +}
> +
> +xicp->tima_os[TM_CPPR] = cppr;
> +}
> +
> +/*
> + * Thread Interrupt Management Area MMIO
> + */
> +static uint64_t xive_tm_read_special(XiveICPState *icp, hwaddr offset,
> + unsigned size)
> +{
> +uint64_t ret = -1;
> +
> +if (offset == TM_SPC_ACK_OS_REG && size == 2) {
> +ret = xive_icp_accept(icp);
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
> +  HWADDR_PRIx" size %d\n", offset, size);
> +}
> +
> +return ret;
> +}
> +
> +static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +XiveICPState *icp = XIVE_ICP(cpu->intc);
> +uint64_t ret = -1;
> +int i;
> +
> +if (offset >= TM_SPC_ACK_EBB) {
> +return xive_tm_read_special(icp, offset, size);
> +}
> +
> +if (offset & TM_QW1_OS) {
> +switch (size) {
> +case 1:
> +case 2:
> +case 4:
> +case 8:
> +if (QEMU_IS_ALIGNED(offset, size)) {
> +ret = 0;
> +for (i = 0; i < size; i++) {
> +ret |= icp->tima[offset + i] << (8 * i);
> +}
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR,
> +  "XIVE: invalid TIMA read alignment @%"
> +  HWADDR_PRIx" size %d\n", offset, size);
> +}
> +break;
> +default:
> +g_assert_not_reached();
> +}
> +} else {
> +qemu_log_mask(LOG_UNIMP, "XIVE: does handle non-OS TIMA ring @%"
> +  HWADDR_PRIx"\n", offset);
> +}
> +
> +return ret;
> +}
> +
> +static bool xive_tm_is_readonly(uint8_t index)
> +{
> +/* Let's be optimistic and prepare ground for HV mode support */
> +switch (index) {
> +case TM_QW1_OS + TM_CPPR:
> +return false;
> +default:
> +return true;
> +}
> +}
> +
> +static void xive_tm_write_special(XiveICPState *xicp, hwaddr offset,
> +  uint64_t value, unsigned size)
> +{
> +if (offset == TM_SPC_SET_OS_PENDING && size == 1) {
> +xicp->tima_os[TM_IPB] |= priority_to_ipb(value & 0xff);
> +} else {
> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA write @%"
> +  HWADDR_PRIx" size %d\n", offset, size);
> +}
> +
> +/* TODO: support TM_SPC_ACK_OS_EL */
> +}
> +
> +static void xive_tm_write(void *opaque, hwaddr offset,
> +   uint64_t value, unsigned size)
> +{
> +PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +XiveICPState *icp = XIVE_ICP(cpu->intc);
> +int i;
> +
> +if (offset >= TM_SPC_ACK_EBB) {
>

Re: [Qemu-devel] [RFC PATCH 09/26] ppc/xive: add an overall memory region for the ESBs

2017-07-23 Thread David Gibson
On Mon, Jul 24, 2017 at 04:09:31PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2017-07-24 at 14:49 +1000, David Gibson wrote:
> > On Wed, Jul 05, 2017 at 07:13:22PM +0200, Cédric Le Goater wrote:
> > > Each source adds its own ESB mempry region to the overall ESB memory
> > > region of the controller. It will be mapped in the CPU address space
> > > when XIVE is activated.
> > > 
> > > The default mapping address for the ESB memory region is the same one
> > > used on baremetal.
> > > 
> > > Signed-off-by: Cédric Le Goater 
> > > ---
> > >  hw/intc/xive-internal.h |  5 +
> > >  hw/intc/xive.c  | 44 +++-
> > >  2 files changed, 48 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> > > index 8e755aa88a14..c06be823aad0 100644
> > > --- a/hw/intc/xive-internal.h
> > > +++ b/hw/intc/xive-internal.h
> > > @@ -98,6 +98,7 @@ struct XIVE {
> > >  SysBusDevice parent;
> > >  
> > >  /* Properties */
> > > +uint32_t chip_id;
> > 
> > So there is a XIVE object per chip.  How does this work on PAPR?  One
> > logical chip/XIVE, or something more complex?
> 
> One global XIVE for PAPR. For the MMIOs, the way it works is that:
> 
>  - For MMIOs pertaining to a specific interrupt or queue, there's an H-
> call that will return the proper "guest physical" address. For qemu
> with KVM we'll have to probably create a single chunk of qemu address
> space (a single mem region) that contains individual pages mapped with
> MAP_FIXED originating from the different HW bits, we still need to sort
> out how exactly we'll do that in practice.
> 
>  - For the TIMA (the presentation MMIOs), those are always at the same
> physical address for everybody (so for a guest it's a single memory
> region we'll map to that physical address), the HW "knows" which HW
> thread is talking to it (and the hypervisor tells the HW which vcpu is
> running on a given HW thread at a given point in time). That address is
> obtained from the device-tree

Ok.  That leaves "chip_id" as a rather surprising thing to see in an
object which will appear on PAPR.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH 3/3] qemu-iotests: require CONFIG_LINUX_AIO for test 087

2017-07-23 Thread Jing Liu

Hi Cleber,


On 2017/7/21 上午11:47, Cleber Rosa wrote:

One of the "sub-"tests of test 087 requires CONFIG_LINUX_AIO.

As a PoC/RFC, this goes the easy route and skips the test as a whole
when that feature is missing.  Other approaches include splitting
the test and adding extra filtering.

Signed-off-by: Cleber Rosa 
---
  tests/qemu-iotests/087 | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index f8e4903..a2fb7de 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -34,6 +34,7 @@ status=1  # failure is the default!
  _supported_fmt qcow2
  _supported_proto file
  _supported_os Linux
+_require_feature CONFIG_LINUX_AIO

I tested that CONFIG_NETTLE_KDF is also a necessary for 087.

+_require_feature CONFIG_NETTLE_KDF




  function do_run_qemu()
  {





Re: [Qemu-devel] [PATCH] pc: acpi: force FADT rev1 for old i440fx machine types

2017-07-23 Thread Igor Mammedov
On Sat, 22 Jul 2017 02:40:46 +0300
"Michael S. Tsirkin"  wrote:

> On Fri, Jul 21, 2017 at 12:10:48PM +0200, Igor Mammedov wrote:
> > On Fri, 21 Jul 2017 10:49:55 +0100
> > "Daniel P. Berrange"  wrote:
> >   
> > > On Fri, Jul 21, 2017 at 11:32:11AM +0200, Igor Mammedov wrote:  
> > > > w2k used to boot on QEMU until we bumped revision of FADT to rev3
> > > > (commit 77af8a2b hw/i386: Use Rev3 FADT (ACPI 2.0) instead of Rev1 to 
> > > > improve guest OS support.)
> > > > 
> > > > Considering that w2k is ancient and long time EOLed, leave default
> > > > rev3 but make pc-i440fx-2.9 and older machine types to force rev1
> > > > so old setups won't break (w2k could boot).
> > > 
> > > There needs to be a machine type property added to control this
> > > feature. When provisioning new VMs, management apps need to be
> > > able to set the property explicitly - having them rely on picking
> > > particular machine type name+versions is not viable, because
> > > downstream vendors replace the machine types with their own
> > > names + versions.  
> > having property doesn't really help here and we don't do it for every
> > compat tweak /ex: save_tsc_khz, linuxboot_dma_enabled/.
> > 
> > Management would not benefit much from having property vs machine version
> > as it would have to encode somewhere that for w2k it should set
> > some machine property or pick a particular machine type.  
> 
> I think I'd disagree with that. If
> users might need this for compatibility with some guests,
> then it should be a property not just a machine type.
> 
> But see below - I think we rushed it for the PC anyway.
> 
> > Probably no one would worry about fixing virt-install or something
> > else for the sake of w2k and if they are going to fix it
> > it doesn't matter if they map machine type vs property.
> > 
> > Also with new machine type deprecation policy we would be able
> > easily to phase out rev1 support along with 2.9 machine,
> > but if you expose property then removing it would break
> > CLI not only for 2.9 but possible later machines if it's set there.
> > 
> > So I'm against adding properties/CLI options for unless we have to in this 
> > case,
> > and I'm not convinced that w2k deserves it.  
> 
> If I have to choose, I'd say Mac OSX is way less interesting than old
> windows versions. Lots of people have software that will only run on old
> windows and there's probably good money to be had running it on new
> hardware in VMs. And PC machine is all about compatibility - we have Q35
> for new stuff.  Besides OSX uses q35 anyway I think.
Question is for how long we are going to maintain legacy stuff,
ACPI spec periodically adds new stuff, which someday is going
to break legacy OSes. And maintaining 2 branches of or worse
a mix will cost us in time and future regressions, we need to
have some policy to cut off legacy features that hold us back,
like we are starting to do with machine types.
w2k has been EOLed in 2010 even if we drop its support now,
users still can use it as they have an option to use old QEMU
for that.
(Ladi said that w2k fails to install since 2.7).

> 
> So maybe the right thing to do is to
> - switch default for PC back to rev 1
> - keep default for Q35 at rev 3
> 
> No machinetype hacks.
it's still machine hack pc vs q35 and an extra branch to look
after but it's better than an option, I'll respin patch.

> 
> > > > 
> > > > Signed-off-by: Igor Mammedov 
> > > > ---
> > > > CC: Programmingkid 
> > > > CC: Phil Dennis-Jordan 
> > > > CC: "Michael S. Tsirkin" 
> > > > 
> > > > Only compile test since I don't have w2k to test with
> > > > 
> > > > ---
> > > >  include/hw/i386/pc.h |  1 +
> > > >  hw/i386/acpi-build.c | 26 +++---
> > > >  hw/i386/pc_piix.c|  2 ++
> > > >  3 files changed, 22 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > > > index d80859b..d6f65dd 100644
> > > > --- a/include/hw/i386/pc.h
> > > > +++ b/include/hw/i386/pc.h
> > > > @@ -122,6 +122,7 @@ struct PCMachineClass {
> > > >  bool rsdp_in_ram;
> > > >  int legacy_acpi_table_size;
> > > >  unsigned acpi_data_size;
> > > > +bool force_rev1_fadt;
> > > >  
> > > >  /* SMBIOS compat: */
> > > >  bool smbios_defaults;
> > > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > > > index 6b7bade..227f9ad 100644
> > > > --- a/hw/i386/acpi-build.c
> > > > +++ b/hw/i386/acpi-build.c
> > > > @@ -272,7 +272,7 @@ build_facs(GArray *table_data, BIOSLinker *linker)
> > > >  }
> > > >  
> > > >  /* Load chipset information in FADT */
> > > > -static void fadt_setup(AcpiFadtDescriptorRev3 *fadt, AcpiPmInfo *pm)
> > > > +static void fadt_setup(AcpiFadtDescriptorRev3 *fadt, AcpiPmInfo *pm, 
> > > > bool rev1)
> > > >  {
> > > >  fadt->model = 1;
> > > >  fadt->reserved1 = 0;
> > > > @@ -304,6 +304,9 @@ static void fadt_setup(AcpiFadtDescriptorRev3 
> > > > *fadt, AcpiPmInfo *pm)
> > > >  fadt->flags |= cpu_

Re: [Qemu-devel] [RFC PATCH 07/26] ppc/xive: add MMIO handlers to the XIVE interrupt source

2017-07-23 Thread Alexey Kardashevskiy
On 06/07/17 03:13, Cédric Le Goater wrote:
> Each interrupt source is associated with a 2-bit state machine called
> an Event State Buffer (ESB). It is controlled by MMIO to trigger
> events.
> 
> See code for more details on the states.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  hw/intc/xive.c| 230 
> ++
>  include/hw/ppc/xive.h |   3 +
>  2 files changed, 233 insertions(+)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 9ff14c0da595..816031b8ac81 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -32,6 +32,226 @@ static void xive_icp_irq(XiveICSState *xs, int lisn)
>  }
>  
>  /*
> + * "magic" Event State Buffer (ESB) MMIO offsets.
> + *
> + * Each interrupt source has a 2-bit state machine called ESB
> + * which can be controlled by MMIO. It's made of 2 bits, P and
> + * Q. P indicates that an interrupt is pending (has been sent
> + * to a queue and is waiting for an EOI). Q indicates that the
> + * interrupt has been triggered while pending.
> + *
> + * This acts as a coalescing mechanism in order to guarantee
> + * that a given interrupt only occurs at most once in a queue.
> + *
> + * When doing an EOI, the Q bit will indicate if the interrupt
> + * needs to be re-triggered.
> + *
> + * The following offsets into the ESB MMIO allow to read or
> + * manipulate the PQ bits. They must be used with an 8-bytes
> + * load instruction. They all return the previous state of the
> + * interrupt (atomically).
> + *
> + * Additionally, some ESB pages support doing an EOI via a
> + * store at 0 and some ESBs support doing a trigger via a
> + * separate trigger page.
> + */
> +#define XIVE_ESB_GET0x800
> +#define XIVE_ESB_SET_PQ_00  0xc00
> +#define XIVE_ESB_SET_PQ_01  0xd00
> +#define XIVE_ESB_SET_PQ_10  0xe00
> +#define XIVE_ESB_SET_PQ_11  0xf00
> +
> +#define XIVE_ESB_VAL_P  0x2
> +#define XIVE_ESB_VAL_Q  0x1


These are not used. I'd suggest defining the states below using these two.


> +
> +#define XIVE_ESB_RESET  0x0
> +#define XIVE_ESB_PENDING0x2
> +#define XIVE_ESB_QUEUED 0x3
> +#define XIVE_ESB_OFF0x1
> +
> +static uint8_t xive_pq_get(XIVE *x, uint32_t lisn)
> +{
> +uint32_t idx = lisn;
> +uint32_t byte = idx / 4;
> +uint32_t bit  = (idx % 4) * 2;
> +uint8_t* pqs = (uint8_t *) x->sbe;
> +
> +return (pqs[byte] >> bit) & 0x3;
> +}
> +
> +static void xive_pq_set(XIVE *x, uint32_t lisn, uint8_t pq)
> +{
> +uint32_t idx = lisn;
> +uint32_t byte = idx / 4;
> +uint32_t bit  = (idx % 4) * 2;
> +uint8_t* pqs = (uint8_t *) x->sbe;
> +
> +pqs[byte] &= ~(0x3 << bit);
> +pqs[byte] |= (pq & 0x3) << bit;
> +}
> +
> +static bool xive_pq_eoi(XIVE *x, uint32_t lisn)


Should not it return uint8_t as well (like xive_pq_get() does)? The value
than returned from .read() is uint64_t (a binary value).




-- 
Alexey



Re: [Qemu-devel] Improving QMP test coverage

2017-07-23 Thread Markus Armbruster
Cleber Rosa  writes:

> On 07/21/2017 11:33 AM, Stefan Hajnoczi wrote:
>>> Output testing style delegates checking ouput to diff.  I rather like it
>>> when text output is readily available.  It is when testing QMP.  A
>>> non-trivial example using this style could be useful, as discussing
>>> ideas tends to be more productive when they come with patches.
>> 
>> Yes, I was considering how many of the Python iotests could be rewritten
>> comfortably in shell.  It is nice when the test simply executes commands
>> and the output file shows the entire history of what happened.  Great
>> for debugging.
>> 
>> Stefan
>> 
> I'd like to have a better understanding of the major pain points here.
>
> Although this can be seen as a matter of taste, style preferences and
> even religion, I guess it's safe to say that Python can scale better
> than shell.  The upside of shell based tests is the "automatic" and
> complete logging, right?  Running "bash -x /path/to/test.sh" will give
> much more *useful* information than "python -v /path/to/test.py" will, fact.
>
> I believe this has to do with how *generic* Python code is written, and
> how builtin functions and most of the standard Python libraries work as
> they do.  Now, when writing code aimed at testing, making use of testing
> oriented libraries and tools, one would expect much more useful and
> readily available debug information.
>
> I'm biased, for sure, but that's what you get when you write basic tests
> using the Avocado libraries.  For instance, when using process.run()[1]
> within a test, you can choose to see its command output quite easily
> with a command such as "avocado --show=avocado.test.stdout run test.py".
>
> Using other custom logging channels is also trivial (for instance for
> specific QMP communication)[2][3].
>
> I wonder if such logging capabilities fill in the gap of what you
> describe as "[when the] output file shows the entire history of what
> happened".

Test code language is orthogonal to verification method (with code
vs. with diff).  Except verifying with shell code would be obviously
nuts[*].

The existing iotests written in Python verify with code, and the ones
written in shell verify with diff.  Doesn't mean that we have to port
from Python to shell to gain "verify with diff".

I don't doubt that featureful test frameworks like Avocado provide
adequate tools to debug tests.  The lure of the shell is its perceived
simplicity: everybody knows (well, should know) how to write and debug
simple shell scripts.  Of course, the simplicity evaporates when the
scripts grow beyond "simple".  Scare quotes, because what's simple for
Alice may not be so simple for Bob.

> BTW, I'll defer the discussion of using an external tool to check the
> output and determine test success/failure, because it is IMO a
> complementary topic, and I believe I understand its use cases.

Yes.  Regardless, I want to tell you *now* how tired of writing code to
verify test output I am.  Even of reading it.  Output text is easier to
read than code that tries to verify output text.  Diffs are easier to
read than stack backtraces.

> [1] -
> http://avocado-framework.readthedocs.io/en/52.0/api/utils/avocado.utils.html#avocado.utils.process.run
> [2] -
> http://avocado-framework.readthedocs.io/en/52.0/WritingTests.html#advanced-logging-capabilities
> [3] - https://www.youtube.com/watch?v=htUbOsh8MZI

[*] Very many nutty things are more obviously nuts in shell.  It's an
advantage of sorts ;)