date:20220424

Re: [PATCH] Warn user if the vga flag is passed but no vga device is created

2022-04-24 Thread Markus Armbruster

Gautam Agrawal  writes:

> This patch is in regards to this 
> issue:https://gitlab.com/qemu-project/qemu/-/issues/581#.
> A global boolean variable "vga_interface_created"(declared in 
> softmmu/globals.c)
> has been used to track the creation of vga interface. If the vga flag is 
> passed in the command
> line "default_vga"(declared in softmmu/vl.c) variable is set to 0. To warn 
> user, the condition
> checks if vga_interface_created is false and default_vga is equal to 0.
>
> The warning "No vga device is created" is logged if vga flag is passed
> but no vga device is created. This patch has been tested for
> x86_64, i386, sparc, sparc64 and arm boards.

Suggest to include a reproducer here, e.g.

$ qemu-system-x86_64 -S -display none -M microvm -vga std
qemu-system-x86_64: warning: No vga device is created

See below for my critique of the warning message.

>
> Signed-off-by: Gautam Agrawal 
> ---
>  hw/isa/isa-bus.c| 1 +
>  hw/pci/pci.c| 1 +
>  hw/sparc/sun4m.c| 2 ++
>  hw/sparc64/sun4u.c  | 1 +
>  include/sysemu/sysemu.h | 1 +
>  softmmu/globals.c   | 1 +
>  softmmu/vl.c| 3 +++
>  7 files changed, 10 insertions(+)
>
> diff --git a/hw/isa/isa-bus.c b/hw/isa/isa-bus.c
> index 0ad1c5fd65..cd5ad3687d 100644
> --- a/hw/isa/isa-bus.c
> +++ b/hw/isa/isa-bus.c
> @@ -166,6 +166,7 @@ bool isa_realize_and_unref(ISADevice *dev, ISABus *bus, 
> Error **errp)
>  
>  ISADevice *isa_vga_init(ISABus *bus)
>  {
> +vga_interface_created = true;
>  switch (vga_interface_type) {
>  case VGA_CIRRUS:
>  return isa_create_simple(bus, "isa-cirrus-vga");
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index dae9119bfe..fab9c80f8d 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2038,6 +2038,7 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus 
> *rootbus,
>  
>  PCIDevice *pci_vga_init(PCIBus *bus)
>  {
> +vga_interface_created = true;
>  switch (vga_interface_type) {
>  case VGA_CIRRUS:
>  return pci_create_simple(bus, -1, "cirrus-vga");
> diff --git a/hw/sparc/sun4m.c b/hw/sparc/sun4m.c
> index 7f3a7c0027..f45e29acc8 100644
> --- a/hw/sparc/sun4m.c
> +++ b/hw/sparc/sun4m.c
> @@ -921,6 +921,7 @@ static void sun4m_hw_init(MachineState *machine)
>  /* sbus irq 5 */
>  cg3_init(hwdef->tcx_base, slavio_irq[11], 0x0010,
>   graphic_width, graphic_height, graphic_depth);
> +vga_interface_created = true;
>  } else {
>  /* If no display specified, default to TCX */
>  if (graphic_depth != 8 && graphic_depth != 24) {
> @@ -936,6 +937,7 @@ static void sun4m_hw_init(MachineState *machine)
>  
>  tcx_init(hwdef->tcx_base, slavio_irq[11], 0x0010,
>   graphic_width, graphic_height, graphic_depth);
> +vga_interface_created = true;
>  }
>  }
>  
> diff --git a/hw/sparc64/sun4u.c b/hw/sparc64/sun4u.c
> index cda7df36e3..75334dba71 100644
> --- a/hw/sparc64/sun4u.c
> +++ b/hw/sparc64/sun4u.c
> @@ -633,6 +633,7 @@ static void sun4uv_init(MemoryRegion *address_space_mem,
>  switch (vga_interface_type) {
>  case VGA_STD:
>  pci_create_simple(pci_busA, PCI_DEVFN(2, 0), "VGA");
> +vga_interface_created = true;
>  break;
>  case VGA_NONE:
>  break;
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index b9421e03ff..a558b895e4 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -32,6 +32,7 @@ typedef enum {
>  } VGAInterfaceType;
>  
>  extern int vga_interface_type;
> +extern bool vga_interface_created;
>  
>  extern int graphic_width;
>  extern int graphic_height;
> diff --git a/softmmu/globals.c b/softmmu/globals.c
> index 3ebd718e35..1a5f8d42ad 100644
> --- a/softmmu/globals.c
> +++ b/softmmu/globals.c
> @@ -40,6 +40,7 @@ int nb_nics;
>  NICInfo nd_table[MAX_NICS];
>  int autostart = 1;
>  int vga_interface_type = VGA_NONE;
> +bool vga_interface_created = false;
>  Chardev *parallel_hds[MAX_PARALLEL_PORTS];
>  int win2k_install_hack;
>  int singlestep;
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 6f646531a0..cb79fa1f42 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2734,6 +2734,9 @@ static void qemu_machine_creation_done(void)
>  if (foreach_device_config(DEV_GDB, gdbserver_start) < 0) {
>  exit(1);
>  }
> +if (!vga_interface_created && !default_vga) {
> +warn_report("No vga device is created");

True, but this leaves the user guessing why.

Pointing to the option would help:

qemu-system-x86_64: warning: -vga std: No vga device is created

To get this, use loc_save() to save the option's location along
@vga_model, then bracket the warn_report() with loc_push_restore() and
loc_pop().

The option to ask the board to create a video device is spelled -vga for
historical reasons.  Some of its arguments aren't VGA devices, e.g. tcx.
-help is phrased accordingly:

Re: [PATCH v5 0/3] util/thread-pool: Expose minimun and maximum size

2022-04-24 Thread Markus Armbruster

Nicolas Saenz Julienne  writes:

> As discussed on the previous RFC[1] the thread-pool's dynamic thread
> management doesn't play well with real-time and latency sensitive
> systems. This series introduces a set of controls that'll permit
> achieving more deterministic behaviours, for example by fixing the
> pool's size.
>
> We first introduce a new common interface to event loop configuration by
> moving iothread's already available properties into an abstract class
> called 'EventLooopBackend' and have both 'IOThread' and the newly
> created 'MainLoop' inherit the properties from that class.
>
> With this new configuration interface in place it's relatively simple to
> introduce new options to fix the even loop's thread pool sizes. The
> resulting QAPI looks like this:
>
> -object main-loop,id=main-loop,thread-pool-min=1,thread-pool-max=1
>
> Note that all patches are bisect friendly and pass all the tests.
>
> [1] 
> https://patchwork.ozlabs.org/project/qemu-devel/patch/20220202175234.656711-1-nsaen...@redhat.com/
>
> @Stefan I kept your Signed-off-by, since the changes trivial/not
> thread-pool related

With the doc nit in PATCH 2 addressed, QAPI schema
Acked-by: Markus Armbruster

Re: [PATCH v5 2/3] util/main-loop: Introduce the main loop into QOM

2022-04-24 Thread Markus Armbruster

Nicolas Saenz Julienne  writes:

> 'event-loop-base' provides basic property handling for all 'AioContext'
> based event loops. So let's define a new 'MainLoopClass' that inherits
> from it. This will permit tweaking the main loop's properties through
> qapi as well as through the command line using the '-object' keyword[1].
> Only one instance of 'MainLoopClass' might be created at any time.
>
> 'EventLoopBaseClass' learns a new callback, 'can_be_deleted()' so as to
> mark 'MainLoop' as non-deletable.
>
> [1] For example:
>   -object main-loop,id=main-loop,aio-max-batch=
>
> Signed-off-by: Nicolas Saenz Julienne 
> Reviewed-by: Stefan Hajnoczi 

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index a2439533c5..51f3acaad8 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -540,6 +540,16 @@
>  '*poll-grow': 'int',
>  '*poll-shrink': 'int' } }
>  
> +##
> +# @MainLoopProperties:
> +#
> +# Properties for the main-loop object.
> +#

Please add

   # Since: 7.1

> +##
> +{ 'struct': 'MainLoopProperties',
> +  'base': 'EventLoopBaseProperties',
> +  'data': {} }
> +
>  ##
>  # @MemoryBackendProperties:
>  #
> @@ -830,6 +840,7 @@
>  { 'name': 'input-linux',
>'if': 'CONFIG_LINUX' },
>  'iothread',
> +'main-loop',
>  { 'name': 'memory-backend-epc',
>'if': 'CONFIG_LINUX' },
>  'memory-backend-file',
> @@ -895,6 +906,7 @@
>'input-linux':{ 'type': 'InputLinuxProperties',
>'if': 'CONFIG_LINUX' },
>'iothread':   'IothreadProperties',
> +  'main-loop':  'MainLoopProperties',
>'memory-backend-epc': { 'type': 'MemoryBackendEpcProperties',
>'if': 'CONFIG_LINUX' },
>'memory-backend-file':'MemoryBackendFileProperties',

[...]

[PATCH v8 5/5] hw/acpi/aml-build: Use existing CPU topology to build PPTT table

2022-04-24 Thread Gavin Shan

When the PPTT table is built, the CPU topology is re-calculated, but
it's unecessary because the CPU topology has been populated in
virt_possible_cpu_arch_ids() on arm/virt machine.

This reworks build_pptt() to avoid by reusing the existing IDs in
ms->possible_cpus. Currently, the only user of build_pptt() is
arm/virt machine.

Signed-off-by: Gavin Shan 
Tested-by: Yanan Wang 
Reviewed-by: Yanan Wang 
Acked-by: Igor Mammedov 
---
 hw/acpi/aml-build.c | 111 +++-
 1 file changed, 48 insertions(+), 63 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 4086879ebf..e6bfac95c7 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -2002,86 +2002,71 @@ void build_pptt(GArray *table_data, BIOSLinker *linker, 
MachineState *ms,
 const char *oem_id, const char *oem_table_id)
 {
 MachineClass *mc = MACHINE_GET_CLASS(ms);
-GQueue *list = g_queue_new();
-guint pptt_start = table_data->len;
-guint parent_offset;
-guint length, i;
-int uid = 0;
-int socket;
+CPUArchIdList *cpus = ms->possible_cpus;
+int64_t socket_id = -1, cluster_id = -1, core_id = -1;
+uint32_t socket_offset = 0, cluster_offset = 0, core_offset = 0;
+uint32_t pptt_start = table_data->len;
+int n;
 AcpiTable table = { .sig = "PPTT", .rev = 2,
 .oem_id = oem_id, .oem_table_id = oem_table_id };
 
 acpi_table_begin(, table_data);
 
-for (socket = 0; socket < ms->smp.sockets; socket++) {
-g_queue_push_tail(list,
-GUINT_TO_POINTER(table_data->len - pptt_start));
-build_processor_hierarchy_node(
-table_data,
-/*
- * Physical package - represents the boundary
- * of a physical package
- */
-(1 << 0),
-0, socket, NULL, 0);
-}
-
-if (mc->smp_props.clusters_supported) {
-length = g_queue_get_length(list);
-for (i = 0; i < length; i++) {
-int cluster;
-
-parent_offset = GPOINTER_TO_UINT(g_queue_pop_head(list));
-for (cluster = 0; cluster < ms->smp.clusters; cluster++) {
-g_queue_push_tail(list,
-GUINT_TO_POINTER(table_data->len - pptt_start));
-build_processor_hierarchy_node(
-table_data,
-(0 << 0), /* not a physical package */
-parent_offset, cluster, NULL, 0);
-}
+/*
+ * This works with the assumption that cpus[n].props.*_id has been
+ * sorted from top to down levels in mc->possible_cpu_arch_ids().
+ * Otherwise, the unexpected and duplicated containers will be
+ * created.
+ */
+for (n = 0; n < cpus->len; n++) {
+if (cpus->cpus[n].props.socket_id != socket_id) {
+assert(cpus->cpus[n].props.socket_id > socket_id);
+socket_id = cpus->cpus[n].props.socket_id;
+cluster_id = -1;
+core_id = -1;
+socket_offset = table_data->len - pptt_start;
+build_processor_hierarchy_node(table_data,
+(1 << 0), /* Physical package */
+0, socket_id, NULL, 0);
 }
-}
 
-length = g_queue_get_length(list);
-for (i = 0; i < length; i++) {
-int core;
-
-parent_offset = GPOINTER_TO_UINT(g_queue_pop_head(list));
-for (core = 0; core < ms->smp.cores; core++) {
-if (ms->smp.threads > 1) {
-g_queue_push_tail(list,
-GUINT_TO_POINTER(table_data->len - pptt_start));
-build_processor_hierarchy_node(
-table_data,
-(0 << 0), /* not a physical package */
-parent_offset, core, NULL, 0);
-} else {
-build_processor_hierarchy_node(
-table_data,
-(1 << 1) | /* ACPI Processor ID valid */
-(1 << 3),  /* Node is a Leaf */
-parent_offset, uid++, NULL, 0);
+if (mc->smp_props.clusters_supported) {
+if (cpus->cpus[n].props.cluster_id != cluster_id) {
+assert(cpus->cpus[n].props.cluster_id > cluster_id);
+cluster_id = cpus->cpus[n].props.cluster_id;
+core_id = -1;
+cluster_offset = table_data->len - pptt_start;
+build_processor_hierarchy_node(table_data,
+(0 << 0), /* Not a physical package */
+socket_offset, cluster_id, NULL, 0);
 }
+} else {
+cluster_offset = socket_offset;
 }
-}
 
-length = g_queue_get_length(list);
-for (i = 0; i < length; i++) {
-int thread;
+if (ms->smp.threads == 1) {
+build_processor_hierarchy_node(table_data,
+(1 << 1) | /* ACPI Processor ID valid */
+(1 << 3),  /* Node is

[PATCH v8 4/5] hw/arm/virt: Fix CPU's default NUMA node ID

2022-04-24 Thread Gavin Shan

When CPU-to-NUMA association isn't explicitly provided by users,
the default one is given by mc->get_default_cpu_node_id(). However,
the CPU topology isn't fully considered in the default association
and this causes CPU topology broken warnings on booting Linux guest.

For example, the following warning messages are observed when the
Linux guest is booted with the following command lines.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
  -accel kvm -machine virt,gic-version=host   \
  -cpu host   \
  -smp 6,sockets=2,cores=3,threads=1  \
  -m 1024M,slots=16,maxmem=64G\
  -object memory-backend-ram,id=mem0,size=128M\
  -object memory-backend-ram,id=mem1,size=128M\
  -object memory-backend-ram,id=mem2,size=128M\
  -object memory-backend-ram,id=mem3,size=128M\
  -object memory-backend-ram,id=mem4,size=128M\
  -object memory-backend-ram,id=mem4,size=384M\
  -numa node,nodeid=0,memdev=mem0 \
  -numa node,nodeid=1,memdev=mem1 \
  -numa node,nodeid=2,memdev=mem2 \
  -numa node,nodeid=3,memdev=mem3 \
  -numa node,nodeid=4,memdev=mem4 \
  -numa node,nodeid=5,memdev=mem5
 :
  alternatives: patching kernel code
  BUG: arch topology borken
  the CLS domain not a subset of the MC domain
  
  BUG: arch topology borken
  the DIE domain not a subset of the NODE domain

With current implementation of mc->get_default_cpu_node_id(),
CPU#0 to CPU#5 are associated with NODE#0 to NODE#5 separately.
That's incorrect because CPU#0/1/2 should be associated with same
NUMA node because they're seated in same socket.

This fixes the issue by considering the socket ID when the default
CPU-to-NUMA association is provided in virt_possible_cpu_arch_ids().
With this applied, no more CPU topology broken warnings are seen
from the Linux guest. The 6 CPUs are associated with NODE#0/1, but
there are no CPUs associated with NODE#2/3/4/5.

Signed-off-by: Gavin Shan 
Reviewed-by: Igor Mammedov 
Reviewed-by: Yanan Wang 
---
 hw/arm/virt.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 0fd7f9a6a1..091054662c 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2552,7 +2552,9 @@ virt_cpu_index_to_props(MachineState *ms, unsigned 
cpu_index)
 
 static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
 {
-return idx % ms->numa_state->num_nodes;
+int64_t socket_id = ms->possible_cpus->cpus[idx].props.socket_id;
+
+return socket_id % ms->numa_state->num_nodes;
 }
 
 static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
-- 
2.23.0

[PATCH v8 3/5] hw/arm/virt: Consider SMP configuration in CPU topology

2022-04-24 Thread Gavin Shan

Currently, the SMP configuration isn't considered when the CPU
topology is populated. In this case, it's impossible to provide
the default CPU-to-NUMA mapping or association based on the socket
ID of the given CPU.

This takes account of SMP configuration when the CPU topology
is populated. The die ID for the given CPU isn't assigned since
it's not supported on arm/virt machine. Besides, the used SMP
configuration in qtest/numa-test/aarch64_numa_cpu() is corrcted
to avoid testing failure

Signed-off-by: Gavin Shan 
Reviewed-by: Yanan Wang 
---
 hw/arm/virt.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 5bdd98e4a1..0fd7f9a6a1 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2560,6 +2560,7 @@ static const CPUArchIdList 
*virt_possible_cpu_arch_ids(MachineState *ms)
 int n;
 unsigned int max_cpus = ms->smp.max_cpus;
 VirtMachineState *vms = VIRT_MACHINE(ms);
+MachineClass *mc = MACHINE_GET_CLASS(vms);
 
 if (ms->possible_cpus) {
 assert(ms->possible_cpus->len == max_cpus);
@@ -2573,8 +2574,20 @@ static const CPUArchIdList 
*virt_possible_cpu_arch_ids(MachineState *ms)
 ms->possible_cpus->cpus[n].type = ms->cpu_type;
 ms->possible_cpus->cpus[n].arch_id =
 virt_cpu_mp_affinity(vms, n);
+
+assert(!mc->smp_props.dies_supported);
+ms->possible_cpus->cpus[n].props.has_socket_id = true;
+ms->possible_cpus->cpus[n].props.socket_id =
+n / (ms->smp.clusters * ms->smp.cores * ms->smp.threads);
+ms->possible_cpus->cpus[n].props.has_cluster_id = true;
+ms->possible_cpus->cpus[n].props.cluster_id =
+(n / (ms->smp.cores * ms->smp.threads)) % ms->smp.clusters;
+ms->possible_cpus->cpus[n].props.has_core_id = true;
+ms->possible_cpus->cpus[n].props.core_id =
+(n / ms->smp.threads) % ms->smp.cores;
 ms->possible_cpus->cpus[n].props.has_thread_id = true;
-ms->possible_cpus->cpus[n].props.thread_id = n;
+ms->possible_cpus->cpus[n].props.thread_id =
+n % ms->smp.threads;
 }
 return ms->possible_cpus;
 }
-- 
2.23.0

[PATCH v8 2/5] qtest/numa-test: Specify CPU topology in aarch64_numa_cpu()

2022-04-24 Thread Gavin Shan

The CPU topology isn't enabled on arm/virt machine yet, but we're
going to do it in next patch. After the CPU topology is enabled by
next patch, "thrad-id=1" becomes invalid because the CPU core is
preferred on arm/virt machine. It means these two CPUs have 0/1
as their core IDs, but their thread IDs are all 0. It will trigger
test failure as the following message indicates:

  [14/21 qemu:qtest+qtest-aarch64 / qtest-aarch64/numa-test  ERROR
  1.48s   killed by signal 6 SIGABRT
  >>> 
G_TEST_DBUS_DAEMON=/home/gavin/sandbox/qemu.main/tests/dbus-vmstate-daemon.sh \
  QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon 
\
  QTEST_QEMU_BINARY=./qemu-system-aarch64   
\
  QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=83  
\
  /home/gavin/sandbox/qemu.main/build/tests/qtest/numa-test --tap -k
  ――
  stderr:
  qemu-system-aarch64: -numa cpu,node-id=0,thread-id=1: no match found

This fixes the issue by providing comprehensive SMP configurations
in aarch64_numa_cpu(). The SMP configurations aren't used before
the CPU topology is enabled in next patch.

Signed-off-by: Gavin Shan 
Reviewed-by: Yanan Wang 
---
 tests/qtest/numa-test.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c
index 90bf68a5b3..aeda8c774c 100644
--- a/tests/qtest/numa-test.c
+++ b/tests/qtest/numa-test.c
@@ -223,7 +223,8 @@ static void aarch64_numa_cpu(const void *data)
 QTestState *qts;
 g_autofree char *cli = NULL;
 
-cli = make_cli(data, "-machine smp.cpus=2 "
+cli = make_cli(data, "-machine "
+"smp.cpus=2,smp.sockets=1,smp.clusters=1,smp.cores=1,smp.threads=2 "
 "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 "
 "-numa cpu,node-id=1,thread-id=0 "
 "-numa cpu,node-id=0,thread-id=1");
-- 
2.23.0

[PATCH v8 1/5] qapi/machine.json: Add cluster-id

2022-04-24 Thread Gavin Shan

This adds cluster-id in CPU instance properties, which will be used
by arm/virt machine. Besides, the cluster-id is also verified or
dumped in various spots:

  * hw/core/machine.c::machine_set_cpu_numa_node() to associate
CPU with its NUMA node.

  * hw/core/machine.c::machine_numa_finish_cpu_init() to record
CPU slots with no NUMA mapping set.

  * hw/core/machine-hmp-cmds.c::hmp_hotpluggable_cpus() to dump
cluster-id.

Signed-off-by: Gavin Shan 
Reviewed-by: Yanan Wang 
---
 hw/core/machine-hmp-cmds.c |  4 
 hw/core/machine.c  | 16 
 qapi/machine.json  |  6 --
 3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 4e2f319aeb..5cb5eecbfc 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -77,6 +77,10 @@ void hmp_hotpluggable_cpus(Monitor *mon, const QDict *qdict)
 if (c->has_die_id) {
 monitor_printf(mon, "die-id: \"%" PRIu64 "\"\n", c->die_id);
 }
+if (c->has_cluster_id) {
+monitor_printf(mon, "cluster-id: \"%" PRIu64 "\"\n",
+   c->cluster_id);
+}
 if (c->has_core_id) {
 monitor_printf(mon, "core-id: \"%" PRIu64 "\"\n", c->core_id);
 }
diff --git a/hw/core/machine.c b/hw/core/machine.c
index cb9bbc844d..700c1e76b8 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -682,6 +682,11 @@ void machine_set_cpu_numa_node(MachineState *machine,
 return;
 }
 
+if (props->has_cluster_id && !slot->props.has_cluster_id) {
+error_setg(errp, "cluster-id is not supported");
+return;
+}
+
 if (props->has_socket_id && !slot->props.has_socket_id) {
 error_setg(errp, "socket-id is not supported");
 return;
@@ -701,6 +706,11 @@ void machine_set_cpu_numa_node(MachineState *machine,
 continue;
 }
 
+if (props->has_cluster_id &&
+props->cluster_id != slot->props.cluster_id) {
+continue;
+}
+
 if (props->has_die_id && props->die_id != slot->props.die_id) {
 continue;
 }
@@ -995,6 +1005,12 @@ static char *cpu_slot_to_string(const CPUArchId *cpu)
 }
 g_string_append_printf(s, "die-id: %"PRId64, cpu->props.die_id);
 }
+if (cpu->props.has_cluster_id) {
+if (s->len) {
+g_string_append_printf(s, ", ");
+}
+g_string_append_printf(s, "cluster-id: %"PRId64, 
cpu->props.cluster_id);
+}
 if (cpu->props.has_core_id) {
 if (s->len) {
 g_string_append_printf(s, ", ");
diff --git a/qapi/machine.json b/qapi/machine.json
index d25a481ce4..4c417e32a5 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -868,10 +868,11 @@
 # @node-id: NUMA node ID the CPU belongs to
 # @socket-id: socket number within node/board the CPU belongs to
 # @die-id: die number within socket the CPU belongs to (since 4.1)
-# @core-id: core number within die the CPU belongs to
+# @cluster-id: cluster number within die the CPU belongs to (since 7.1)
+# @core-id: core number within cluster the CPU belongs to
 # @thread-id: thread number within core the CPU belongs to
 #
-# Note: currently there are 5 properties that could be present
+# Note: currently there are 6 properties that could be present
 #   but management should be prepared to pass through other
 #   properties with device_add command to allow for future
 #   interface extension. This also requires the filed names to be kept in
@@ -883,6 +884,7 @@
   'data': { '*node-id': 'int',
 '*socket-id': 'int',
 '*die-id': 'int',
+'*cluster-id': 'int',
 '*core-id': 'int',
 '*thread-id': 'int'
   }
-- 
2.23.0

[PATCH v8 0/5] hw/arm/virt: Fix CPU's default NUMA node ID

2022-04-24 Thread Gavin Shan

When the CPU-to-NUMA association isn't provided by user, the default NUMA
node ID for the specific CPU is returned from virt_get_default_cpu_node_id().
Unfortunately, the default NUMA node ID breaks socket boundary and leads to
the broken CPU topology warning message in Linux guest. This series intends
to fix the issue.

  PATCH[1/5] Add cluster-id to CPU instance property
  PATCH[2/5] Fixes test failure in qtest/numa-test/aarch64_numa_cpu()
  PATCH[3/5] Uses SMP configuration to populate CPU topology
  PATCH[4/5] Fixes the broken CPU topology by considering the socket boundary
 when the default NUMA node ID is given
  PATCH[5/5] Uses the populated CPU topology to build PPTT table, instead of
 calculate it again

Changelog
=
v8:
   * Separate PATCH[v8 2/5] to fix test failure in qtest/
 numa-test/aarch64_numa_cpu()   (Igor)
   * Improvents to coding style, changelog and comments (Yanan)
v6/v7:
   * Fixed description for 'cluster-id' and 'core-id'   (Yanan)
   * Remove '% ms->smp.sockets' in socket ID calculation(Yanan)
   * Fixed tests/qtest/numa-test/aarch64_numa_cpu() (Yanan)
   * Initialized offset variables in build_pptt()   (Jonathan)
   * Added comments about the expected and sorted layout of
 cpus[n].props.*_id and assert() on the exceptional cases   (Igor)
v4/v5:
   * Split PATCH[v3 1/3] to PATCH[v5 1/4] and PATCH[v5 2/4].
 Verify or dump 'clsuter-id' in various spots   (Yanan)
   * s/within cluster/within cluster\/die/ for 'core-id' in
 qapi/machine.json  (Igor)
   * Apply '% ms->smp.{sockets, clusters, cores, threads} in
 virt_possible_cpu_arch_ids() as x86 does   (Igor)
   * Use [0 - possible_cpus->len] as ACPI processor UID to
 build PPTT table and PATCH[v3 4/4] is dropped  (Igor)
   * Simplified build_pptt() to add all entries in one loop
 on ms->possible_cpus   (Igor)
v3:
   * Split PATCH[v2 1/3] to PATCH[v3 1/4] and PATCH[v3 2/4] (Yanan)
   * Don't take account of die ID in CPU topology population
 and added assert(!mc->smp_props.dies_supported)(Yanan/Igor)
   * Assign cluster_id and use it when building PPTT table  (Yanan/Igor)
v2:
   * Populate the CPU topology in virt_possible_cpu_arch_ids()
 so that it can be reused in virt_get_default_cpu_node_id() (Igor)
   * Added PATCH[2/3] to use the existing CPU topology when the
 PPTT table is built(Igor)
   * Added PATCH[3/3] to take thread ID as ACPI processor ID
 in MADT and SRAT table (Gavin)

Gavin Shan (5):
  qapi/machine.json: Add cluster-id
  qtest/numa-test: Specify CPU topology in aarch64_numa_cpu()
  hw/arm/virt: Consider SMP configuration in CPU topology
  hw/arm/virt: Fix CPU's default NUMA node ID
  hw/acpi/aml-build: Use existing CPU topology to build PPTT table

 hw/acpi/aml-build.c| 111 -
 hw/arm/virt.c  |  19 ++-
 hw/core/machine-hmp-cmds.c |   4 ++
 hw/core/machine.c  |  16 ++
 qapi/machine.json  |   6 +-
 tests/qtest/numa-test.c|   3 +-
 6 files changed, 91 insertions(+), 68 deletions(-)

-- 
2.23.0

答复: [PATCH] hw/sd/sdhci: Block Size Register bits [14:12] is lost

2022-04-24 Thread Gao, Lu

ping

https://patchew.org/QEMU/20220321055618.4026-1-lu@verisilicon.com/

Please help review the patch.
Thanks.
B.R.

-邮件原件-
发件人: Gao, Lu 
发送时间: Monday, March 21, 2022 1:56 PM
收件人: qemu-devel@nongnu.org
抄送: Gao, Lu; Wen, Jianxian; Philippe Mathieu-Daudé; Bin Meng; open list:SD 
(Secure Card)
主题: [PATCH] hw/sd/sdhci: Block Size Register bits [14:12] is lost

Block Size Register bits [14:12] is SDMA Buffer Boundary, it is missed
in register write, but it is needed in SDMA transfer. e.g. it will be
used in sdhci_sdma_transfer_multi_blocks to calculate boundary_ variables.

Missing this field will cause wrong operation for different SDMA Buffer
Boundary settings.

Signed-off-by: Lu Gao 
Signed-off-by: Jianxian Wen 
---
 hw/sd/sdhci.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c
index e0bbc90344..350ceb487d 100644
--- a/hw/sd/sdhci.c
+++ b/hw/sd/sdhci.c
@@ -321,6 +321,8 @@ static void sdhci_poweron_reset(DeviceState *dev)
 
 static void sdhci_data_transfer(void *opaque);
 
+#define BLOCK_SIZE_MASK (4 * KiB - 1)
+
 static void sdhci_send_command(SDHCIState *s)
 {
 SDRequest request;
@@ -371,7 +373,8 @@ static void sdhci_send_command(SDHCIState *s)
 
 sdhci_update_irq(s);
 
-if (!timeout && s->blksize && (s->cmdreg & SDHC_CMD_DATA_PRESENT)) {
+if (!timeout && (s->blksize & BLOCK_SIZE_MASK) &&
+(s->cmdreg & SDHC_CMD_DATA_PRESENT)) {
 s->data_count = 0;
 sdhci_data_transfer(s);
 }
@@ -406,7 +409,6 @@ static void sdhci_end_transfer(SDHCIState *s)
 /*
  * Programmed i/o data transfer
  */
-#define BLOCK_SIZE_MASK (4 * KiB - 1)
 
 /* Fill host controller's read buffer with BLKSIZE bytes of data from card */
 static void sdhci_read_block_from_card(SDHCIState *s)
@@ -1137,7 +1139,8 @@ sdhci_write(void *opaque, hwaddr offset, uint64_t val, 
unsigned size)
 s->sdmasysad = (s->sdmasysad & mask) | value;
 MASKED_WRITE(s->sdmasysad, mask, value);
 /* Writing to last byte of sdmasysad might trigger transfer */
-if (!(mask & 0xFF00) && s->blkcnt && s->blksize &&
+if (!(mask & 0xFF00) && s->blkcnt &&
+(s->blksize & BLOCK_SIZE_MASK) &&
 SDHC_DMA_TYPE(s->hostctl1) == SDHC_CTRL_SDMA) {
 if (s->trnmod & SDHC_TRNS_MULTI) {
 sdhci_sdma_transfer_multi_blocks(s);
@@ -1151,7 +1154,11 @@ sdhci_write(void *opaque, hwaddr offset, uint64_t val, 
unsigned size)
 if (!TRANSFERRING_DATA(s->prnsts)) {
 uint16_t blksize = s->blksize;
 
-MASKED_WRITE(s->blksize, mask, extract32(value, 0, 12));
+/*
+ * [14:12] SDMA Buffer Boundary
+ * [11:00] Transfer Block Size
+ */
+MASKED_WRITE(s->blksize, mask, extract32(value, 0, 15));
 MASKED_WRITE(s->blkcnt, mask >> 16, value >> 16);
 
 /* Limit block size to the maximum buffer size */
-- 
2.17.1

[PATCH v2 19/42] i386: Rewrite blendv helpers

2022-04-24 Thread Paul Brook

Rewrite the blendv helpers so that they can easily be extended to support
the AVX encodings, which make all 4 arguments explicit.

No functional changes to the existing helpers

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 119 +-
 1 file changed, 60 insertions(+), 59 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 3202c00572..9f388b02b9 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2141,73 +2141,74 @@ void glue(helper_palignr, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s,
 }
 }
 
-#define XMM0 (env->xmm_regs[0])
+#if SHIFT >= 1
+
+#define BLEND_V128(elem, num, F, b) do {\
+d->elem(b + 0) = F(v->elem(b + 0), s->elem(b + 0), m->elem(b + 0)); \
+d->elem(b + 1) = F(v->elem(b + 1), s->elem(b + 1), m->elem(b + 1)); \
+if (num > 2) {  \
+d->elem(b + 2) = F(v->elem(b + 2), s->elem(b + 2), m->elem(b + 2)); \
+d->elem(b + 3) = F(v->elem(b + 3), s->elem(b + 3), m->elem(b + 3)); \
+}   \
+if (num > 4) {  \
+d->elem(b + 4) = F(v->elem(b + 4), s->elem(b + 4), m->elem(b + 4)); \
+d->elem(b + 5) = F(v->elem(b + 5), s->elem(b + 5), m->elem(b + 5)); \
+d->elem(b + 6) = F(v->elem(b + 6), s->elem(b + 6), m->elem(b + 6)); \
+d->elem(b + 7) = F(v->elem(b + 7), s->elem(b + 7), m->elem(b + 7)); \
+}   \
+if (num > 8) {  \
+d->elem(b + 8) = F(v->elem(b + 8), s->elem(b + 8), m->elem(b + 8)); \
+d->elem(b + 9) = F(v->elem(b + 9), s->elem(b + 9), m->elem(b + 9)); \
+d->elem(b + 10) = F(v->elem(b + 10), s->elem(b + 10), m->elem(b + 
10));\
+d->elem(b + 11) = F(v->elem(b + 11), s->elem(b + 11), m->elem(b + 
11));\
+d->elem(b + 12) = F(v->elem(b + 12), s->elem(b + 12), m->elem(b + 
12));\
+d->elem(b + 13) = F(v->elem(b + 13), s->elem(b + 13), m->elem(b + 
13));\
+d->elem(b + 14) = F(v->elem(b + 14), s->elem(b + 14), m->elem(b + 
14));\
+d->elem(b + 15) = F(v->elem(b + 15), s->elem(b + 15), m->elem(b + 
15));\
+}   \
+} while (0)
 
-#if SHIFT == 1
 #define SSE_HELPER_V(name, elem, num, F)\
-void glue(name, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)   \
+void glue(name, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)   \
 {   \
-d->elem(0) = F(d->elem(0), s->elem(0), XMM0.elem(0));   \
-d->elem(1) = F(d->elem(1), s->elem(1), XMM0.elem(1));   \
-if (num > 2) {  \
-d->elem(2) = F(d->elem(2), s->elem(2), XMM0.elem(2));   \
-d->elem(3) = F(d->elem(3), s->elem(3), XMM0.elem(3));   \
-if (num > 4) {  \
-d->elem(4) = F(d->elem(4), s->elem(4), XMM0.elem(4));   \
-d->elem(5) = F(d->elem(5), s->elem(5), XMM0.elem(5));   \
-d->elem(6) = F(d->elem(6), s->elem(6), XMM0.elem(6));   \
-d->elem(7) = F(d->elem(7), s->elem(7), XMM0.elem(7));   \
-if (num > 8) {  \
-d->elem(8) = F(d->elem(8), s->elem(8), XMM0.elem(8)); \
-d->elem(9) = F(d->elem(9), s->elem(9), XMM0.elem(9)); \
-d->elem(10) = F(d->elem(10), s->elem(10), XMM0.elem(10)); \
-d->elem(11) = F(d->elem(11), s->elem(11), XMM0.elem(11)); \
-d->elem(12) = F(d->elem(12), s->elem(12), XMM0.elem(12)); \
-d->elem(13) = F(d->elem(13), s->elem(13), XMM0.elem(13)); \
-d->elem(14) = F(d->elem(14), s->elem(14), XMM0.elem(14)); \
-d->elem(15) = F(d->elem(15), s->elem(15), XMM0.elem(15)); \
-}   \
-}   \
-}   \
-}
+Reg *v = d; \
+Reg *m = >xmm_regs[0]; \
+BLEND_V128(elem, num, F, 0);\
+YMM_ONLY(BLEND_V128(elem, num, F, num);)\
+}
+
+#define BLEND_I128(elem, num, F, b) do {\
+d->elem(b + 0) = F(v->elem(b + 0), s->elem(b + 0), ((imm >> 0) & 1));   \
+d->elem(b + 1) = F(v->elem(b +

[PATCH v2 31/42] i386: Implement AVX variable shifts

2022-04-24 Thread Paul Brook

These use the W bit to encode the operand width, but otherwise fairly
straightforward.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 17 +
 target/i386/ops_sse_header.h |  6 ++
 target/i386/tcg/translate.c  | 17 +
 3 files changed, 40 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 9b92b9790a..8f2bd48394 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3195,6 +3195,23 @@ void glue(helper_vpermilps_imm, SUFFIX)(CPUX86State *env,
 #endif
 }
 
+#if SHIFT == 1
+#define FPSRLVD(x, c) (c < 32 ? ((x) >> c) : 0)
+#define FPSRLVQ(x, c) (c < 64 ? ((x) >> c) : 0)
+#define FPSRAVD(x, c) ((int32_t)(x) >> (c < 64 ? c : 31))
+#define FPSRAVQ(x, c) ((int64_t)(x) >> (c < 64 ? c : 63))
+#define FPSLLVD(x, c) (c < 32 ? ((x) << c) : 0)
+#define FPSLLVQ(x, c) (c < 64 ? ((x) << c) : 0)
+#endif
+
+SSE_HELPER_L(helper_vpsrlvd, FPSRLVD)
+SSE_HELPER_L(helper_vpsravd, FPSRAVD)
+SSE_HELPER_L(helper_vpsllvd, FPSLLVD)
+
+SSE_HELPER_Q(helper_vpsrlvq, FPSRLVQ)
+SSE_HELPER_Q(helper_vpsravq, FPSRAVQ)
+SSE_HELPER_Q(helper_vpsllvq, FPSLLVQ)
+
 #if SHIFT == 2
 void glue(helper_vbroadcastdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index c52169a030..20db6c4240 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -421,6 +421,12 @@ DEF_HELPER_4(glue(vpermilpd, SUFFIX), void, env, Reg, Reg, 
Reg)
 DEF_HELPER_4(glue(vpermilps, SUFFIX), void, env, Reg, Reg, Reg)
 DEF_HELPER_4(glue(vpermilpd_imm, SUFFIX), void, env, Reg, Reg, i32)
 DEF_HELPER_4(glue(vpermilps_imm, SUFFIX), void, env, Reg, Reg, i32)
+DEF_HELPER_4(glue(vpsrlvd, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpsravd, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpsllvd, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpsrlvq, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpsravq, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpsllvq, SUFFIX), void, env, Reg, Reg, Reg)
 #if SHIFT == 2
 DEF_HELPER_3(glue(vbroadcastdq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_1(vzeroall, void, env)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 358c3ecb0b..4990470083 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3293,6 +3293,9 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x40] = BINARY_OP(pmulld, SSE41, SSE_OPF_MMX),
 #define gen_helper_phminposuw_ymm NULL
 [0x41] = UNARY_OP(phminposuw, SSE41, 0),
+[0x45] = BINARY_OP(vpsrlvd, AVX, SSE_OPF_AVX2),
+[0x46] = BINARY_OP(vpsravd, AVX, SSE_OPF_AVX2),
+[0x47] = BINARY_OP(vpsllvd, AVX, SSE_OPF_AVX2),
 /* vpbroadcastd */
 [0x58] = UNARY_OP(vbroadcastl, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
 /* vpbroadcastq */
@@ -3357,6 +3360,15 @@ static const struct SSEOpHelper_table7 
sse_op_table7[256] = {
 #undef BLENDV_OP
 #undef SPECIAL_OP
 
+#define SSE_OP(name) \
+{gen_helper_ ## name ##_xmm, gen_helper_ ## name ##_ymm}
+static const SSEFunc_0_eppp sse_op_table8[3][2] = {
+SSE_OP(vpsrlvq),
+SSE_OP(vpsravq),
+SSE_OP(vpsllvq),
+};
+#undef SSE_OP
+
 /* VEX prefix not allowed */
 #define CHECK_NO_VEX(s) do { \
 if (s->prefix & PREFIX_VEX) \
@@ -4439,6 +4451,11 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 tcg_temp_free_ptr(mask);
 } else {
 SSEFunc_0_eppp fn = op6.fn[b1].op2;
+if (REX_W(s)) {
+if (b >= 0x45 && b <= 0x47) {
+fn = sse_op_table8[b - 0x45][b1 - 1];
+}
+}
 fn(cpu_env, s->ptr0, s->ptr2, s->ptr1);
 }
 }
-- 
2.36.0

[PATCH v2 12/42] i386: Misc integer AVX helper prep

2022-04-24 Thread Paul Brook

More perparatory work for AVX support in various integer vector helpers

No functional changes to existing helpers.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 133 +-
 1 file changed, 104 insertions(+), 29 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index bb9cbf9ead..d0424140d9 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -557,19 +557,25 @@ SSE_HELPER_W(helper_pavgw, FAVG)
 
 void glue(helper_pmuludq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
-d->Q(0) = (uint64_t)s->L(0) * (uint64_t)d->L(0);
-#if SHIFT == 1
-d->Q(1) = (uint64_t)s->L(2) * (uint64_t)d->L(2);
+Reg *v = d;
+d->Q(0) = (uint64_t)s->L(0) * (uint64_t)v->L(0);
+#if SHIFT >= 1
+d->Q(1) = (uint64_t)s->L(2) * (uint64_t)v->L(2);
+#if SHIFT == 2
+d->Q(2) = (uint64_t)s->L(4) * (uint64_t)v->L(4);
+d->Q(3) = (uint64_t)s->L(6) * (uint64_t)v->L(6);
+#endif
 #endif
 }
 
 void glue(helper_pmaddwd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
+Reg *v = d;
 int i;
 
 for (i = 0; i < (2 << SHIFT); i++) {
-d->L(i) = (int16_t)s->W(2 * i) * (int16_t)d->W(2 * i) +
-(int16_t)s->W(2 * i + 1) * (int16_t)d->W(2 * i + 1);
+d->L(i) = (int16_t)s->W(2 * i) * (int16_t)v->W(2 * i) +
+(int16_t)s->W(2 * i + 1) * (int16_t)v->W(2 * i + 1);
 }
 }
 
@@ -583,31 +589,55 @@ static inline int abs1(int a)
 }
 }
 #endif
+
 void glue(helper_psadbw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
+Reg *v = d;
 unsigned int val;
 
 val = 0;
-val += abs1(d->B(0) - s->B(0));
-val += abs1(d->B(1) - s->B(1));
-val += abs1(d->B(2) - s->B(2));
-val += abs1(d->B(3) - s->B(3));
-val += abs1(d->B(4) - s->B(4));
-val += abs1(d->B(5) - s->B(5));
-val += abs1(d->B(6) - s->B(6));
-val += abs1(d->B(7) - s->B(7));
+val += abs1(v->B(0) - s->B(0));
+val += abs1(v->B(1) - s->B(1));
+val += abs1(v->B(2) - s->B(2));
+val += abs1(v->B(3) - s->B(3));
+val += abs1(v->B(4) - s->B(4));
+val += abs1(v->B(5) - s->B(5));
+val += abs1(v->B(6) - s->B(6));
+val += abs1(v->B(7) - s->B(7));
 d->Q(0) = val;
-#if SHIFT == 1
+#if SHIFT >= 1
 val = 0;
-val += abs1(d->B(8) - s->B(8));
-val += abs1(d->B(9) - s->B(9));
-val += abs1(d->B(10) - s->B(10));
-val += abs1(d->B(11) - s->B(11));
-val += abs1(d->B(12) - s->B(12));
-val += abs1(d->B(13) - s->B(13));
-val += abs1(d->B(14) - s->B(14));
-val += abs1(d->B(15) - s->B(15));
+val += abs1(v->B(8) - s->B(8));
+val += abs1(v->B(9) - s->B(9));
+val += abs1(v->B(10) - s->B(10));
+val += abs1(v->B(11) - s->B(11));
+val += abs1(v->B(12) - s->B(12));
+val += abs1(v->B(13) - s->B(13));
+val += abs1(v->B(14) - s->B(14));
+val += abs1(v->B(15) - s->B(15));
 d->Q(1) = val;
+#if SHIFT == 2
+val = 0;
+val += abs1(v->B(16) - s->B(16));
+val += abs1(v->B(17) - s->B(17));
+val += abs1(v->B(18) - s->B(18));
+val += abs1(v->B(19) - s->B(19));
+val += abs1(v->B(20) - s->B(20));
+val += abs1(v->B(21) - s->B(21));
+val += abs1(v->B(22) - s->B(22));
+val += abs1(v->B(23) - s->B(23));
+d->Q(2) = val;
+val = 0;
+val += abs1(v->B(24) - s->B(24));
+val += abs1(v->B(25) - s->B(25));
+val += abs1(v->B(26) - s->B(26));
+val += abs1(v->B(27) - s->B(27));
+val += abs1(v->B(28) - s->B(28));
+val += abs1(v->B(29) - s->B(29));
+val += abs1(v->B(30) - s->B(30));
+val += abs1(v->B(31) - s->B(31));
+d->Q(3) = val;
+#endif
 #endif
 }
 
@@ -627,8 +657,12 @@ void glue(helper_movl_mm_T0, SUFFIX)(Reg *d, uint32_t val)
 {
 d->L(0) = val;
 d->L(1) = 0;
-#if SHIFT == 1
+#if SHIFT >= 1
 d->Q(1) = 0;
+#if SHIFT == 2
+d->Q(2) = 0;
+d->Q(3) = 0;
+#endif
 #endif
 }
 
@@ -636,8 +670,12 @@ void glue(helper_movl_mm_T0, SUFFIX)(Reg *d, uint32_t val)
 void glue(helper_movq_mm_T0, SUFFIX)(Reg *d, uint64_t val)
 {
 d->Q(0) = val;
-#if SHIFT == 1
+#if SHIFT >= 1
 d->Q(1) = 0;
+#if SHIFT == 2
+d->Q(2) = 0;
+d->Q(3) = 0;
+#endif
 #endif
 }
 #endif
@@ -1251,7 +1289,7 @@ uint32_t glue(helper_pmovmskb, SUFFIX)(CPUX86State *env, 
Reg *s)
 val |= (s->B(5) >> 2) & 0x20;
 val |= (s->B(6) >> 1) & 0x40;
 val |= (s->B(7)) & 0x80;
-#if SHIFT == 1
+#if SHIFT >= 1
 val |= (s->B(8) << 1) & 0x0100;
 val |= (s->B(9) << 2) & 0x0200;
 val |= (s->B(10) << 3) & 0x0400;
@@ -1260,6 +1298,24 @@ uint32_t glue(helper_pmovmskb, SUFFIX)(CPUX86State *env, 
Reg *s)
 val |= (s->B(13) << 6) & 0x2000;
 val |= (s->B(14) << 7) & 0x4000;
 val |= (s->B(15) << 8) & 0x8000;
+#if SHIFT == 2
+val |= ((uint32_t)s->B(16) << 9) & 0x0001;
+val |= ((uint32_t)s->B(17) << 10) & 0x0002;
+val |= ((uint32_t)s->B(18) << 11) & 0x0004;
+val |= ((uint32_t)s->B(19) << 12) & 0x0008;
+val |= ((uint32_t)s->B(20) << 13) & 0x0010;
+val |= ((uint32_t)s->B(21) << 14) &

[PATCH v2 20/42] i386: AVX pclmulqdq

2022-04-24 Thread Paul Brook

Make the pclmulqdq helper AVX ready

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 9f388b02b9..b7100fdce1 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2885,14 +2885,14 @@ target_ulong helper_crc32(uint32_t crc1, target_ulong 
msg, uint32_t len)
 
 #endif
 
-void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s,
-uint32_t ctrl)
+#if SHIFT == 1
+static void clmulq(uint64_t *dest_l, uint64_t *dest_h,
+  uint64_t a, uint64_t b)
 {
-uint64_t ah, al, b, resh, resl;
+uint64_t al, ah, resh, resl;
 
 ah = 0;
-al = d->Q((ctrl & 1) != 0);
-b = s->Q((ctrl & 16) != 0);
+al = a;
 resh = resl = 0;
 
 while (b) {
@@ -2905,8 +2905,25 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *s,
 b >>= 1;
 }
 
-d->Q(0) = resl;
-d->Q(1) = resh;
+*dest_l = resl;
+*dest_h = resh;
+}
+#endif
+
+void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s,
+uint32_t ctrl)
+{
+Reg *v = d;
+uint64_t a, b;
+
+a = v->Q((ctrl & 1) != 0);
+b = s->Q((ctrl & 16) != 0);
+clmulq(>Q(0), >Q(1), a, b);
+#if SHIFT == 2
+a = v->Q(((ctrl & 1) != 0) + 2);
+b = s->Q(((ctrl & 16) != 0) + 2);
+clmulq(>Q(2), >Q(3), a, b);
+#endif
 }
 
 void glue(helper_aesdec, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
-- 
2.36.0

[PATCH v2 32/42] i386: Implement VTEST

2022-04-24 Thread Paul Brook

Noting special here

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 28 
 target/i386/ops_sse_header.h |  2 ++
 target/i386/tcg/translate.c  |  2 ++
 3 files changed, 32 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 8f2bd48394..edf14a25d7 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3212,6 +3212,34 @@ SSE_HELPER_Q(helper_vpsrlvq, FPSRLVQ)
 SSE_HELPER_Q(helper_vpsravq, FPSRAVQ)
 SSE_HELPER_Q(helper_vpsllvq, FPSLLVQ)
 
+void glue(helper_vtestps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint32_t zf = (s->L(0) &  d->L(0)) | (s->L(1) &  d->L(1));
+uint32_t cf = (s->L(0) & ~d->L(0)) | (s->L(1) & ~d->L(1));
+
+zf |= (s->L(2) &  d->L(2)) | (s->L(3) &  d->L(3));
+cf |= (s->L(2) & ~d->L(2)) | (s->L(3) & ~d->L(3));
+#if SHIFT == 2
+zf |= (s->L(4) &  d->L(4)) | (s->L(5) &  d->L(5));
+cf |= (s->L(4) & ~d->L(4)) | (s->L(5) & ~d->L(5));
+zf |= (s->L(6) &  d->L(6)) | (s->L(7) &  d->L(7));
+cf |= (s->L(6) & ~d->L(6)) | (s->L(7) & ~d->L(7));
+#endif
+CC_SRC = ((zf >> 31) ? 0 : CC_Z) | ((cf >> 31) ? 0 : CC_C);
+}
+
+void glue(helper_vtestpd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint64_t zf = (s->Q(0) &  d->Q(0)) | (s->Q(1) &  d->Q(1));
+uint64_t cf = (s->Q(0) & ~d->Q(0)) | (s->Q(1) & ~d->Q(1));
+
+#if SHIFT == 2
+zf |= (s->Q(2) &  d->Q(2)) | (s->Q(3) &  d->Q(3));
+cf |= (s->Q(2) & ~d->Q(2)) | (s->Q(3) & ~d->Q(3));
+#endif
+CC_SRC = ((zf >> 63) ? 0 : CC_Z) | ((cf >> 63) ? 0 : CC_C);
+}
+
 #if SHIFT == 2
 void glue(helper_vbroadcastdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index 20db6c4240..8b93b8e6d6 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -427,6 +427,8 @@ DEF_HELPER_4(glue(vpsllvd, SUFFIX), void, env, Reg, Reg, 
Reg)
 DEF_HELPER_4(glue(vpsrlvq, SUFFIX), void, env, Reg, Reg, Reg)
 DEF_HELPER_4(glue(vpsravq, SUFFIX), void, env, Reg, Reg, Reg)
 DEF_HELPER_4(glue(vpsllvq, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_3(glue(vtestps, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_3(glue(vtestpd, SUFFIX), void, env, Reg, Reg)
 #if SHIFT == 2
 DEF_HELPER_3(glue(vbroadcastdq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_1(vzeroall, void, env)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 4990470083..2fbb7bfcad 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3253,6 +3253,8 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x0b] = BINARY_OP_MMX(pmulhrsw, SSSE3),
 [0x0c] = BINARY_OP(vpermilps, AVX, 0),
 [0x0d] = BINARY_OP(vpermilpd, AVX, 0),
+[0x0e] = CMP_OP(vtestps, AVX),
+[0x0f] = CMP_OP(vtestpd, AVX),
 [0x10] = BLENDV_OP(pblendvb, SSE41, SSE_OPF_MMX),
 [0x14] = BLENDV_OP(blendvps, SSE41, 0),
 [0x15] = BLENDV_OP(blendvpd, SSE41, 0),
-- 
2.36.0

[PATCH v2 14/42] i386: Add size suffix to vector FP helpers

2022-04-24 Thread Paul Brook

For AVX we're going to need both 128 bit (xmm) and 256 bit (ymm) variants of
floating point helpers. Add the register type suffix to the existing
*PS and *PD helpers (SS and SD variants are only valid on 128 bit vectors)

No functional changes.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 48 ++--
 target/i386/ops_sse_header.h | 48 ++--
 target/i386/tcg/translate.c  | 37 +--
 3 files changed, 67 insertions(+), 66 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index c645d2ddbf..fc8fd57aa5 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -699,7 +699,7 @@ void glue(helper_pshufw, SUFFIX)(Reg *d, Reg *s, int order)
 SHUFFLE4(W, s, s, 0);
 }
 #else
-void helper_shufps(Reg *d, Reg *s, int order)
+void glue(helper_shufps, SUFFIX)(Reg *d, Reg *s, int order)
 {
 Reg *v = d;
 uint32_t r0, r1, r2, r3;
@@ -710,7 +710,7 @@ void helper_shufps(Reg *d, Reg *s, int order)
 #endif
 }
 
-void helper_shufpd(Reg *d, Reg *s, int order)
+void glue(helper_shufpd, SUFFIX)(Reg *d, Reg *s, int order)
 {
 Reg *v = d;
 uint64_t r0, r1;
@@ -767,7 +767,7 @@ void glue(helper_pshufhw, SUFFIX)(Reg *d, Reg *s, int order)
 /* XXX: not accurate */
 
 #define SSE_HELPER_S(name, F)   \
-void helper_ ## name ## ps(CPUX86State *env, Reg *d, Reg *s)\
+void glue(helper_ ## name ## ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
 {   \
 d->ZMM_S(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
 d->ZMM_S(1) = F(32, d->ZMM_S(1), s->ZMM_S(1));  \
@@ -780,7 +780,7 @@ void glue(helper_pshufhw, SUFFIX)(Reg *d, Reg *s, int order)
 d->ZMM_S(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
 }   \
 \
-void helper_ ## name ## pd(CPUX86State *env, Reg *d, Reg *s)\
+void glue(helper_ ## name ## pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
 {   \
 d->ZMM_D(0) = F(64, d->ZMM_D(0), s->ZMM_D(0));  \
 d->ZMM_D(1) = F(64, d->ZMM_D(1), s->ZMM_D(1));  \
@@ -816,7 +816,7 @@ SSE_HELPER_S(sqrt, FPU_SQRT)
 
 
 /* float to float conversions */
-void helper_cvtps2pd(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_cvtps2pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 float32 s0, s1;
 
@@ -826,7 +826,7 @@ void helper_cvtps2pd(CPUX86State *env, Reg *d, Reg *s)
 d->ZMM_D(1) = float32_to_float64(s1, >sse_status);
 }
 
-void helper_cvtpd2ps(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_cvtpd2ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 d->ZMM_S(0) = float64_to_float32(s->ZMM_D(0), >sse_status);
 d->ZMM_S(1) = float64_to_float32(s->ZMM_D(1), >sse_status);
@@ -844,7 +844,7 @@ void helper_cvtsd2ss(CPUX86State *env, Reg *d, Reg *s)
 }
 
 /* integer to float */
-void helper_cvtdq2ps(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_cvtdq2ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 d->ZMM_S(0) = int32_to_float32(s->ZMM_L(0), >sse_status);
 d->ZMM_S(1) = int32_to_float32(s->ZMM_L(1), >sse_status);
@@ -852,7 +852,7 @@ void helper_cvtdq2ps(CPUX86State *env, Reg *d, Reg *s)
 d->ZMM_S(3) = int32_to_float32(s->ZMM_L(3), >sse_status);
 }
 
-void helper_cvtdq2pd(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_cvtdq2pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int32_t l0, l1;
 
@@ -929,7 +929,7 @@ WRAP_FLOATCONV(int64_t, float32_to_int64_round_to_zero, 
float32, INT64_MIN)
 WRAP_FLOATCONV(int64_t, float64_to_int64, float64, INT64_MIN)
 WRAP_FLOATCONV(int64_t, float64_to_int64_round_to_zero, float64, INT64_MIN)
 
-void helper_cvtps2dq(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_cvtps2dq, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 {
 d->ZMM_L(0) = x86_float32_to_int32(s->ZMM_S(0), >sse_status);
 d->ZMM_L(1) = x86_float32_to_int32(s->ZMM_S(1), >sse_status);
@@ -937,7 +937,7 @@ void helper_cvtps2dq(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 d->ZMM_L(3) = x86_float32_to_int32(s->ZMM_S(3), >sse_status);
 }
 
-void helper_cvtpd2dq(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_cvtpd2dq, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 {
 d->ZMM_L(0) = x86_float64_to_int32(s->ZMM_D(0), >sse_status);
 d->ZMM_L(1) = x86_float64_to_int32(s->ZMM_D(1), >sse_status);
@@ -979,7 +979,7 @@ int64_t helper_cvtsd2sq(CPUX86State *env, ZMMReg *s)
 #endif
 
 /* float to integer truncated */
-void helper_cvttps2dq(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_cvttps2dq, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 {
 d->ZMM_L(0) =

[PATCH v2 27/42] i386: Translate 256 bit AVX instructions

2022-04-24 Thread Paul Brook

All the work for the helper functions is already done, we just need to build
them, and a few macro tweaks to poulate the lookup tables.

For sse_op_table6 and sse_op_table7 we use #defines to fill in the entries
where and opcode only supports one vector size, rather than complicating the
main table.

Several of the open-coded mov type instruction need special handling, but most
of the rest falls out from the infrastructure we already added.

Also clear the top half of the register after 128 bit VEX register writes.
In the current code this correlates with VEX.L == 0, but there are exceptios
later.

Signed-off-by: Paul Brook 
---
 target/i386/helper.h |   2 +
 target/i386/tcg/fpu_helper.c |   3 +
 target/i386/tcg/translate.c  | 370 +--
 3 files changed, 319 insertions(+), 56 deletions(-)

diff --git a/target/i386/helper.h b/target/i386/helper.h
index ac3b4d1ee3..3da5df98b9 100644
--- a/target/i386/helper.h
+++ b/target/i386/helper.h
@@ -218,6 +218,8 @@ DEF_HELPER_3(movq, void, env, ptr, ptr)
 #include "ops_sse_header.h"
 #define SHIFT 1
 #include "ops_sse_header.h"
+#define SHIFT 2
+#include "ops_sse_header.h"
 
 DEF_HELPER_3(rclb, tl, env, tl, tl)
 DEF_HELPER_3(rclw, tl, env, tl, tl)
diff --git a/target/i386/tcg/fpu_helper.c b/target/i386/tcg/fpu_helper.c
index b391b69635..74cf86c986 100644
--- a/target/i386/tcg/fpu_helper.c
+++ b/target/i386/tcg/fpu_helper.c
@@ -3053,3 +3053,6 @@ void helper_movq(CPUX86State *env, void *d, void *s)
 
 #define SHIFT 1
 #include "ops_sse.h"
+
+#define SHIFT 2
+#include "ops_sse.h"
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 278ed8ed1c..bcd6d47fd0 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2742,6 +2742,29 @@ static inline void gen_ldo_env_A0(DisasContext *s, int 
offset)
 tcg_gen_st_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(1)));
 }
 
+static inline void gen_ldo_env_A0_ymmh(DisasContext *s, int offset)
+{
+int mem_index = s->mem_index;
+tcg_gen_qemu_ld_i64(s->tmp1_i64, s->A0, mem_index, MO_LEUQ);
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(2)));
+tcg_gen_addi_tl(s->tmp0, s->A0, 8);
+tcg_gen_qemu_ld_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(3)));
+}
+
+/* Load 256-bit ymm register value */
+static inline void gen_ldy_env_A0(DisasContext *s, int offset)
+{
+int mem_index = s->mem_index;
+gen_ldo_env_A0(s, offset);
+tcg_gen_addi_tl(s->tmp0, s->A0, 16);
+tcg_gen_qemu_ld_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(2)));
+tcg_gen_addi_tl(s->tmp0, s->A0, 24);
+tcg_gen_qemu_ld_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(3)));
+}
+
 static inline void gen_sto_env_A0(DisasContext *s, int offset)
 {
 int mem_index = s->mem_index;
@@ -2752,6 +2775,29 @@ static inline void gen_sto_env_A0(DisasContext *s, int 
offset)
 tcg_gen_qemu_st_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
 }
 
+static inline void gen_sto_env_A0_ymmh(DisasContext *s, int offset)
+{
+int mem_index = s->mem_index;
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(2)));
+tcg_gen_qemu_st_i64(s->tmp1_i64, s->A0, mem_index, MO_LEUQ);
+tcg_gen_addi_tl(s->tmp0, s->A0, 8);
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(3)));
+tcg_gen_qemu_st_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+}
+
+/* Store 256-bit ymm register value */
+static inline void gen_sty_env_A0(DisasContext *s, int offset)
+{
+int mem_index = s->mem_index;
+gen_sto_env_A0(s, offset);
+tcg_gen_addi_tl(s->tmp0, s->A0, 16);
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(2)));
+tcg_gen_qemu_st_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+tcg_gen_addi_tl(s->tmp0, s->A0, 24);
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, offset + offsetof(ZMMReg, ZMM_Q(3)));
+tcg_gen_qemu_st_i64(s->tmp1_i64, s->tmp0, mem_index, MO_LEUQ);
+}
+
 static inline void gen_op_movo(DisasContext *s, int d_offset, int s_offset)
 {
 tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(0)));
@@ -2760,6 +2806,14 @@ static inline void gen_op_movo(DisasContext *s, int 
d_offset, int s_offset)
 tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(1)));
 }
 
+static inline void gen_op_movo_ymmh(DisasContext *s, int d_offset, int 
s_offset)
+{
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(2)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(2)));
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(3)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(3)));
+}
+
 static inline void

[PATCH v2 15/42] i386: Floating point atithmetic helper AVX prep

2022-04-24 Thread Paul Brook

Prepare the "easy" floating point vector helpers for AVX

No functional changes to existing helpers.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 144 ++
 1 file changed, 119 insertions(+), 25 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index fc8fd57aa5..d308a1ec40 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -762,40 +762,66 @@ void glue(helper_pshufhw, SUFFIX)(Reg *d, Reg *s, int 
order)
 }
 #endif
 
-#if SHIFT == 1
+#if SHIFT >= 1
 /* FPU ops */
 /* XXX: not accurate */
 
-#define SSE_HELPER_S(name, F)   \
-void glue(helper_ ## name ## ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
+#define SSE_HELPER_P(name, F)   \
+void glue(helper_ ## name ## ps, SUFFIX)(CPUX86State *env,  \
+Reg *d, Reg *s) \
 {   \
-d->ZMM_S(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
-d->ZMM_S(1) = F(32, d->ZMM_S(1), s->ZMM_S(1));  \
-d->ZMM_S(2) = F(32, d->ZMM_S(2), s->ZMM_S(2));  \
-d->ZMM_S(3) = F(32, d->ZMM_S(3), s->ZMM_S(3));  \
+Reg *v = d; \
+d->ZMM_S(0) = F(32, v->ZMM_S(0), s->ZMM_S(0));  \
+d->ZMM_S(1) = F(32, v->ZMM_S(1), s->ZMM_S(1));  \
+d->ZMM_S(2) = F(32, v->ZMM_S(2), s->ZMM_S(2));  \
+d->ZMM_S(3) = F(32, v->ZMM_S(3), s->ZMM_S(3));  \
+YMM_ONLY(   \
+d->ZMM_S(4) = F(32, v->ZMM_S(4), s->ZMM_S(4));  \
+d->ZMM_S(5) = F(32, v->ZMM_S(5), s->ZMM_S(5));  \
+d->ZMM_S(6) = F(32, v->ZMM_S(6), s->ZMM_S(6));  \
+d->ZMM_S(7) = F(32, v->ZMM_S(7), s->ZMM_S(7));  \
+)   \
 }   \
 \
-void helper_ ## name ## ss(CPUX86State *env, Reg *d, Reg *s)\
+void glue(helper_ ## name ## pd, SUFFIX)(CPUX86State *env,  \
+Reg *d, Reg *s) \
 {   \
-d->ZMM_S(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
-}   \
+Reg *v = d; \
+d->ZMM_D(0) = F(64, v->ZMM_D(0), s->ZMM_D(0));  \
+d->ZMM_D(1) = F(64, v->ZMM_D(1), s->ZMM_D(1));  \
+YMM_ONLY(   \
+d->ZMM_D(2) = F(64, v->ZMM_D(2), s->ZMM_D(2));  \
+d->ZMM_D(3) = F(64, v->ZMM_D(3), s->ZMM_D(3));  \
+)   \
+}
+
+#if SHIFT == 1
+
+#define SSE_HELPER_S(name, F)   \
+SSE_HELPER_P(name, F)   \
 \
-void glue(helper_ ## name ## pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
+void helper_ ## name ## ss(CPUX86State *env, Reg *d, Reg *s)\
 {   \
-d->ZMM_D(0) = F(64, d->ZMM_D(0), s->ZMM_D(0));  \
-d->ZMM_D(1) = F(64, d->ZMM_D(1), s->ZMM_D(1));  \
+Reg *v = d; \
+d->ZMM_S(0) = F(32, v->ZMM_S(0), s->ZMM_S(0));  \
 }   \
 \
-void helper_ ## name ## sd(CPUX86State *env, Reg *d, Reg *s)\
+void helper_ ## name ## sd(CPUX86State *env, Reg *d, Reg *s)\
 {   \
-d->ZMM_D(0) = F(64, d->ZMM_D(0), s->ZMM_D(0));  \
+Reg *v = d; \
+d->ZMM_D(0) = F(64, v->ZMM_D(0), s->ZMM_D(0));  \
 }
 
+#else
+
+#define SSE_HELPER_S(name, F) SSE_HELPER_P(name, F)
+
+#endif
+
 #define FPU_ADD(size, a, b) float ## size ## _add(a, b, >sse_status)
 #define FPU_SUB(size, a, b) float ## size ## _sub(a, b, >sse_status)
 #define FPU_MUL(size, a, b) float ## size ## _mul(a, b, >sse_status)
 #define FPU_DIV(size, a, b) float ## size ## _div(a, b,

[PATCH v2 26/42] i386: Utility function for 128 bit AVX

2022-04-24 Thread Paul Brook

VEX encoded instructions that write to a (128 bit) xmm register clear the
rest (upper half) of the corresonding (256 bit) ymm register.
When legacy SSE encodings are used the rest of the ymm register is left
unchanged.

Add a utility fuction so that we don't have to keep duplicating this logic.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index d148a2319d..278ed8ed1c 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2780,6 +2780,18 @@ static inline void gen_op_movq_env_0(DisasContext *s, 
int d_offset)
 
 #define ZMM_OFFSET(reg) offsetof(CPUX86State, xmm_regs[reg])
 
+/*
+ * Clear the top half of the ymm register after a VEX.128 instruction
+ * This could be optimized by tracking this in env->hflags
+ */
+static void gen_clear_ymmh(DisasContext *s, int reg)
+{
+if (s->prefix & PREFIX_VEX) {
+gen_op_movq_env_0(s, offsetof(CPUX86State, xmm_regs[reg].ZMM_Q(2)));
+gen_op_movq_env_0(s, offsetof(CPUX86State, xmm_regs[reg].ZMM_Q(3)));
+}
+}
+
 typedef void (*SSEFunc_i_ep)(TCGv_i32 val, TCGv_ptr env, TCGv_ptr reg);
 typedef void (*SSEFunc_l_ep)(TCGv_i64 val, TCGv_ptr env, TCGv_ptr reg);
 typedef void (*SSEFunc_0_epi)(TCGv_ptr env, TCGv_ptr reg, TCGv_i32 val);
-- 
2.36.0

[PATCH v2 25/42] i386: VEX.V encodings (3 operand)

2022-04-24 Thread Paul Brook

Enable translation of VEX encoded AVX instructions.

The big change is the addition of an additional register operand in the VEX.V
field.  This is usually (but not always!) used to explictly encode the
first source operand.

The changes to ops_sse.h and ops_sse_header.h are purely mechanical, with
pervious changes ensuring that the relevant helper functions are ready to
handle the non destructive source operand.

We now have a grater variety of operand patterns for the vector helper
functions. The SSE_OPF_* flags we added to the opcode lookup tables are used
to select between these. This includes e.g. pshufX and cmpX instructions
which were previously overriden by opcode.

One gotcha is the "scalar" vector instructions. The SSE encodings write a
single element to the destination and leave the remainder of the register
unchanged.  The VEX encodings which copy the remainder of the destination from
first source operand. If the operation only has a single source value,
then the VEX.V encodes an additional operand from which is coped to the
the remainder of destination.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 214 +--
 target/i386/ops_sse_header.h | 149 ++---
 target/i386/tcg/translate.c  | 399 +--
 3 files changed, 463 insertions(+), 299 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index e48dfc2fc5..ad3312d353 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -97,9 +97,8 @@
 #define FPSLL(x, c) ((x) << shift)
 #endif
 
-void glue(helper_psrlw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psrlw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 15) {
 d->Q(0) = 0;
@@ -114,9 +113,8 @@ void glue(helper_psrlw, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 }
 
-void glue(helper_psllw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psllw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 15) {
 d->Q(0) = 0;
@@ -131,9 +129,8 @@ void glue(helper_psllw, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 }
 
-void glue(helper_psraw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psraw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 15) {
 shift = 15;
@@ -143,9 +140,8 @@ void glue(helper_psraw, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 SHIFT_HELPER_BODY(4 << SHIFT, W, FPSRAW);
 }
 
-void glue(helper_psrld, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psrld, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 31) {
 d->Q(0) = 0;
@@ -160,9 +156,8 @@ void glue(helper_psrld, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 }
 
-void glue(helper_pslld, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_pslld, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 31) {
 d->Q(0) = 0;
@@ -177,9 +172,8 @@ void glue(helper_pslld, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 }
 
-void glue(helper_psrad, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psrad, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 31) {
 shift = 31;
@@ -189,9 +183,8 @@ void glue(helper_psrad, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 SHIFT_HELPER_BODY(2 << SHIFT, L, FPSRAL);
 }
 
-void glue(helper_psrlq, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psrlq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 63) {
 d->Q(0) = 0;
@@ -206,9 +199,8 @@ void glue(helper_psrlq, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 }
 
-void glue(helper_psllq, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psllq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift;
 if (c->Q(0) > 63) {
 d->Q(0) = 0;
@@ -224,9 +216,8 @@ void glue(helper_psllq, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 
 #if SHIFT >= 1
-void glue(helper_psrldq, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_psrldq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift, i;
 
 shift = c->L(0);
@@ -249,9 +240,8 @@ void glue(helper_psrldq, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 #endif
 }
 
-void glue(helper_pslldq, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
+void glue(helper_pslldq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, Reg *c)
 {
-Reg *s = d;
 int shift, i;
 
 shift = c->L(0);
@@ -321,9 +311,8 @@ void glue(helper_pslldq, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *c)
 }
 
 #define SSE_HELPER_B(name, F)   \
-void glue(name, SUFFIX)(CPUX86State *env, Reg *d, Reg *s) \
+void

[PATCH v2 36/42] i386: Implement VINSERT128/VEXTRACT128

2022-04-24 Thread Paul Brook

128-bit vinsert/vextract instructions. The integer and loating point variants
have the same semantics.

This is where we encounter an instruction encoded with VEX.L == 1 and
a 128 bit (xmm) destination operand.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 78 +
 1 file changed, 78 insertions(+)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 5a11d3c083..4072fa28d3 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2814,6 +2814,24 @@ static inline void gen_op_movo_ymmh(DisasContext *s, int 
d_offset, int s_offset)
 tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(3)));
 }
 
+static inline void gen_op_movo_ymm_l2h(DisasContext *s,
+   int d_offset, int s_offset)
+{
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(0)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(2)));
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(1)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(3)));
+}
+
+static inline void gen_op_movo_ymm_h2l(DisasContext *s,
+   int d_offset, int s_offset)
+{
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(2)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(0)));
+tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset + offsetof(ZMMReg, 
ZMM_Q(3)));
+tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset + offsetof(ZMMReg, 
ZMM_Q(1)));
+}
+
 static inline void gen_op_movq(DisasContext *s, int d_offset, int s_offset)
 {
 tcg_gen_ld_i64(s->tmp1_i64, cpu_env, s_offset);
@@ -3353,9 +3371,13 @@ static const struct SSEOpHelper_table7 
sse_op_table7[256] = {
 [0x15] = SPECIAL_OP(SSE41), /* pextrw */
 [0x16] = SPECIAL_OP(SSE41), /* pextrd/pextrq */
 [0x17] = SPECIAL_OP(SSE41), /* extractps */
+[0x18] = SPECIAL_OP(AVX), /* vinsertf128 */
+[0x19] = SPECIAL_OP(AVX), /* vextractf128 */
 [0x20] = SPECIAL_OP(SSE41), /* pinsrb */
 [0x21] = SPECIAL_OP(SSE41), /* insertps */
 [0x22] = SPECIAL_OP(SSE41), /* pinsrd/pinsrq */
+[0x38] = SPECIAL_OP(AVX), /* vinserti128 */
+[0x39] = SPECIAL_OP(AVX), /* vextracti128 */
 [0x40] = BINARY_OP(dpps, SSE41, 0),
 #define gen_helper_dppd_ymm NULL
 [0x41] = BINARY_OP(dppd, SSE41, 0),
@@ -5145,6 +5167,62 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 gen_clear_ymmh(s, reg);
 break;
+case 0x38: /* vinserti128 */
+CHECK_AVX2_256(s);
+/* fall through */
+case 0x18: /* vinsertf128 */
+CHECK_AVX(s);
+if ((s->prefix & PREFIX_VEX) == 0 || s->vex_l == 0) {
+goto illegal_op;
+}
+if (mod == 3) {
+if (val & 1) {
+gen_op_movo_ymm_l2h(s, ZMM_OFFSET(reg),
+ZMM_OFFSET(rm));
+} else {
+gen_op_movo(s, ZMM_OFFSET(reg), ZMM_OFFSET(rm));
+}
+} else {
+if (val & 1) {
+gen_ldo_env_A0_ymmh(s, ZMM_OFFSET(reg));
+} else {
+gen_ldo_env_A0(s, ZMM_OFFSET(reg));
+}
+}
+if (reg != reg_v) {
+if (val & 1) {
+gen_op_movo(s, ZMM_OFFSET(reg), ZMM_OFFSET(reg_v));
+} else {
+gen_op_movo_ymmh(s, ZMM_OFFSET(reg),
+ ZMM_OFFSET(reg_v));
+}
+}
+break;
+case 0x39: /* vextracti128 */
+CHECK_AVX2_256(s);
+/* fall through */
+case 0x19: /* vextractf128 */
+CHECK_AVX_V0(s);
+if ((s->prefix & PREFIX_VEX) == 0 || s->vex_l == 0) {
+goto illegal_op;
+}
+if (mod == 3) {
+op1_offset = ZMM_OFFSET(rm);
+if (val & 1) {
+gen_op_movo_ymm_h2l(s, ZMM_OFFSET(rm),
+ZMM_OFFSET(reg));
+} else {
+gen_op_movo(s, ZMM_OFFSET(rm), ZMM_OFFSET(reg));
+}
+gen_clear_ymmh(s, rm);
+} else{
+if (val & 1) {
+gen_sto_env_A0_ymmh(s, ZMM_OFFSET(reg));
+

[PATCH v2 21/42] i386: AVX+AES helpers

2022-04-24 Thread Paul Brook

Make the AES vector helpers AVX ready

No functional changes to existing helpers

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 63 ++--
 target/i386/ops_sse_header.h | 55 ++-
 2 files changed, 85 insertions(+), 33 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index b7100fdce1..48cec40074 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2929,64 +2929,92 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *s,
 void glue(helper_aesdec, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int i;
-Reg st = *d;
+Reg st = *d; // v
 Reg rk = *s;
 
 for (i = 0 ; i < 4 ; i++) {
-d->L(i) = rk.L(i) ^ bswap32(AES_Td0[st.B(AES_ishifts[4*i+0])] ^
-AES_Td1[st.B(AES_ishifts[4*i+1])] ^
-AES_Td2[st.B(AES_ishifts[4*i+2])] ^
-AES_Td3[st.B(AES_ishifts[4*i+3])]);
+d->L(i) = rk.L(i) ^ bswap32(AES_Td0[st.B(AES_ishifts[4 * i + 0])] ^
+AES_Td1[st.B(AES_ishifts[4 * i + 1])] ^
+AES_Td2[st.B(AES_ishifts[4 * i + 2])] ^
+AES_Td3[st.B(AES_ishifts[4 * i + 3])]);
 }
+#if SHIFT == 2
+for (i = 0 ; i < 4 ; i++) {
+d->L(i + 4) = rk.L(i + 4) ^ bswap32(
+AES_Td0[st.B(AES_ishifts[4 * i + 0] + 16)] ^
+AES_Td1[st.B(AES_ishifts[4 * i + 1] + 16)] ^
+AES_Td2[st.B(AES_ishifts[4 * i + 2] + 16)] ^
+AES_Td3[st.B(AES_ishifts[4 * i + 3] + 16)]);
+}
+#endif
 }
 
 void glue(helper_aesdeclast, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int i;
-Reg st = *d;
+Reg st = *d; // v
 Reg rk = *s;
 
 for (i = 0; i < 16; i++) {
 d->B(i) = rk.B(i) ^ (AES_isbox[st.B(AES_ishifts[i])]);
 }
+#if SHIFT == 2
+for (i = 0; i < 16; i++) {
+d->B(i + 16) = rk.B(i + 16) ^ (AES_isbox[st.B(AES_ishifts[i] + 16)]);
+}
+#endif
 }
 
 void glue(helper_aesenc, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int i;
-Reg st = *d;
+Reg st = *d; // v
 Reg rk = *s;
 
 for (i = 0 ; i < 4 ; i++) {
-d->L(i) = rk.L(i) ^ bswap32(AES_Te0[st.B(AES_shifts[4*i+0])] ^
-AES_Te1[st.B(AES_shifts[4*i+1])] ^
-AES_Te2[st.B(AES_shifts[4*i+2])] ^
-AES_Te3[st.B(AES_shifts[4*i+3])]);
+d->L(i) = rk.L(i) ^ bswap32(AES_Te0[st.B(AES_shifts[4 * i + 0])] ^
+AES_Te1[st.B(AES_shifts[4 * i + 1])] ^
+AES_Te2[st.B(AES_shifts[4 * i + 2])] ^
+AES_Te3[st.B(AES_shifts[4 * i + 3])]);
 }
+#if SHIFT == 2
+for (i = 0 ; i < 4 ; i++) {
+d->L(i + 4) = rk.L(i + 4) ^ bswap32(
+AES_Te0[st.B(AES_shifts[4 * i + 0] + 16)] ^
+AES_Te1[st.B(AES_shifts[4 * i + 1] + 16)] ^
+AES_Te2[st.B(AES_shifts[4 * i + 2] + 16)] ^
+AES_Te3[st.B(AES_shifts[4 * i + 3] + 16)]);
+}
+#endif
 }
 
 void glue(helper_aesenclast, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int i;
-Reg st = *d;
+Reg st = *d; // v
 Reg rk = *s;
 
 for (i = 0; i < 16; i++) {
 d->B(i) = rk.B(i) ^ (AES_sbox[st.B(AES_shifts[i])]);
 }
-
+#if SHIFT == 2
+for (i = 0; i < 16; i++) {
+d->B(i + 16) = rk.B(i + 16) ^ (AES_sbox[st.B(AES_shifts[i] + 16)]);
+}
+#endif
 }
 
+#if SHIFT == 1
 void glue(helper_aesimc, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
 int i;
 Reg tmp = *s;
 
 for (i = 0 ; i < 4 ; i++) {
-d->L(i) = bswap32(AES_imc[tmp.B(4*i+0)][0] ^
-  AES_imc[tmp.B(4*i+1)][1] ^
-  AES_imc[tmp.B(4*i+2)][2] ^
-  AES_imc[tmp.B(4*i+3)][3]);
+d->L(i) = bswap32(AES_imc[tmp.B(4 * i + 0)][0] ^
+  AES_imc[tmp.B(4 * i + 1)][1] ^
+  AES_imc[tmp.B(4 * i + 2)][2] ^
+  AES_imc[tmp.B(4 * i + 3)][3]);
 }
 }
 
@@ -3004,6 +3032,7 @@ void glue(helper_aeskeygenassist, SUFFIX)(CPUX86State 
*env, Reg *d, Reg *s,
 d->L(3) = (d->L(2) << 24 | d->L(2) >> 8) ^ ctrl;
 }
 #endif
+#endif
 
 #undef SSE_HELPER_S
 
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index b8b0666f61..203afbb5a1 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -47,7 +47,7 @@ DEF_HELPER_3(glue(pslld, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(psrlq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(psllq, SUFFIX), void, env, Reg, Reg)
 
-#if SHIFT == 1
+#if SHIFT >= 1
 DEF_HELPER_3(glue(psrldq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(pslldq, SUFFIX), void, env, Reg, Reg)
 #endif
@@ -105,7 +105,7 @@ SSE_HELPER_L(pcmpeql,

[PATCH v2 35/42] i386: Implement VPERM

2022-04-24 Thread Paul Brook

A set of shuffle operations that operate on complete 256 bit registers.
The integer and floating point variants have identical semantics.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 73 
 target/i386/ops_sse_header.h |  3 ++
 target/i386/tcg/translate.c  |  9 +
 3 files changed, 85 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 14a2d1bf78..04d2006cd8 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3407,6 +3407,79 @@ void helper_vzeroupper_hi8(CPUX86State *env)
 }
 }
 #endif
+
+void helper_vpermdq_ymm(CPUX86State *env,
+Reg *d, Reg *v, Reg *s, uint32_t order)
+{
+uint64_t r0, r1, r2, r3;
+
+switch (order & 3) {
+case 0:
+r0 = v->Q(0);
+r1 = v->Q(1);
+break;
+case 1:
+r0 = v->Q(2);
+r1 = v->Q(3);
+break;
+case 2:
+r0 = s->Q(0);
+r1 = s->Q(1);
+break;
+case 3:
+r0 = s->Q(2);
+r1 = s->Q(3);
+break;
+}
+switch ((order >> 4) & 3) {
+case 0:
+r2 = v->Q(0);
+r3 = v->Q(1);
+break;
+case 1:
+r2 = v->Q(2);
+r3 = v->Q(3);
+break;
+case 2:
+r2 = s->Q(0);
+r3 = s->Q(1);
+break;
+case 3:
+r2 = s->Q(2);
+r3 = s->Q(3);
+break;
+}
+d->Q(0) = r0;
+d->Q(1) = r1;
+d->Q(2) = r2;
+d->Q(3) = r3;
+}
+
+void helper_vpermq_ymm(CPUX86State *env, Reg *d, Reg *s, uint32_t order)
+{
+uint64_t r0, r1, r2, r3;
+r0 = s->Q(order & 3);
+r1 = s->Q((order >> 2) & 3);
+r2 = s->Q((order >> 4) & 3);
+r3 = s->Q((order >> 6) & 3);
+d->Q(0) = r0;
+d->Q(1) = r1;
+d->Q(2) = r2;
+d->Q(3) = r3;
+}
+
+void helper_vpermd_ymm(CPUX86State *env, Reg *d, Reg *v, Reg *s)
+{
+uint32_t r[8];
+int i;
+
+for (i = 0; i < 8; i++) {
+r[i] = s->L(v->L(i) & 7);
+}
+for (i = 0; i < 8; i++) {
+d->L(i) = r[i];
+}
+}
 #endif
 #endif
 
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index e5d8ea9bb7..099e6e8ffc 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -457,6 +457,9 @@ DEF_HELPER_1(vzeroupper, void, env)
 DEF_HELPER_1(vzeroall_hi8, void, env)
 DEF_HELPER_1(vzeroupper_hi8, void, env)
 #endif
+DEF_HELPER_5(vpermdq_ymm, void, env, Reg, Reg, Reg, i32)
+DEF_HELPER_4(vpermq_ymm, void, env, Reg, Reg, i32)
+DEF_HELPER_4(vpermd_ymm, void, env, Reg, Reg, Reg)
 #endif
 #endif
 
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index fe1ab58d07..5a11d3c083 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3258,6 +3258,8 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x10] = BLENDV_OP(pblendvb, SSE41, SSE_OPF_MMX),
 [0x14] = BLENDV_OP(blendvps, SSE41, 0),
 [0x15] = BLENDV_OP(blendvpd, SSE41, 0),
+#define gen_helper_vpermd_xmm NULL
+[0x16] = BINARY_OP(vpermd, AVX, SSE_OPF_AVX2), /* vpermps */
 [0x17] = CMP_OP(ptest, SSE41),
 /* TODO:Some vbroadcast variants require AVX2 */
 [0x18] = UNARY_OP(vbroadcastl, AVX, SSE_OPF_SCALAR), /* vbroadcastss */
@@ -3287,6 +3289,7 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x33] = UNARY_OP(pmovzxwd, SSE41, SSE_OPF_MMX),
 [0x34] = UNARY_OP(pmovzxwq, SSE41, SSE_OPF_MMX),
 [0x35] = UNARY_OP(pmovzxdq, SSE41, SSE_OPF_MMX),
+[0x36] = BINARY_OP(vpermd, AVX, SSE_OPF_AVX2), /* vpermd */
 [0x37] = BINARY_OP(pcmpgtq, SSE41, SSE_OPF_MMX),
 [0x38] = BINARY_OP(pminsb, SSE41, SSE_OPF_MMX),
 [0x39] = BINARY_OP(pminsd, SSE41, SSE_OPF_MMX),
@@ -3329,8 +3332,13 @@ static const struct SSEOpHelper_table6 
sse_op_table6[256] = {
 
 /* prefix [66] 0f 3a */
 static const struct SSEOpHelper_table7 sse_op_table7[256] = {
+#define gen_helper_vpermq_xmm NULL
+[0x00] = UNARY_OP(vpermq, AVX, SSE_OPF_AVX2),
+[0x01] = UNARY_OP(vpermq, AVX, SSE_OPF_AVX2), /* vpermpd */
 [0x04] = UNARY_OP(vpermilps_imm, AVX, 0),
 [0x05] = UNARY_OP(vpermilpd_imm, AVX, 0),
+#define gen_helper_vpermdq_xmm NULL
+[0x06] = BINARY_OP(vpermdq, AVX, 0), /* vperm2f128 */
 [0x08] = UNARY_OP(roundps, SSE41, 0),
 [0x09] = UNARY_OP(roundpd, SSE41, 0),
 #define gen_helper_roundss_ymm NULL
@@ -3353,6 +3361,7 @@ static const struct SSEOpHelper_table7 sse_op_table7[256] 
= {
 [0x41] = BINARY_OP(dppd, SSE41, 0),
 [0x42] = BINARY_OP(mpsadbw, SSE41, SSE_OPF_MMX),
 [0x44] = BINARY_OP(pclmulqdq, PCLMULQDQ, 0),
+[0x46] = BINARY_OP(vpermdq, AVX, SSE_OPF_AVX2), /* vperm2i128 */
 #define gen_helper_pcmpestrm_ymm NULL
 [0x60] = CMP_OP(pcmpestrm, SSE42),
 #define gen_helper_pcmpestri_ymm NULL
-- 
2.36.0

[PATCH v2 39/42] i386: Enable AVX cpuid bits when using TCG

2022-04-24 Thread Paul Brook

Include AVX and AVX2 in the guest cpuid features supported by TCG

Signed-off-by: Paul Brook 
---
 target/i386/cpu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 99343be926..bd35233d5b 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -625,12 +625,12 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
   CPUID_EXT_SSE41 | CPUID_EXT_SSE42 | CPUID_EXT_POPCNT | \
   CPUID_EXT_XSAVE | /* CPUID_EXT_OSXSAVE is dynamic */   \
   CPUID_EXT_MOVBE | CPUID_EXT_AES | CPUID_EXT_HYPERVISOR | \
-  CPUID_EXT_RDRAND)
+  CPUID_EXT_RDRAND | CPUID_EXT_AVX)
   /* missing:
   CPUID_EXT_DTES64, CPUID_EXT_DSCPL, CPUID_EXT_VMX, CPUID_EXT_SMX,
   CPUID_EXT_EST, CPUID_EXT_TM2, CPUID_EXT_CID, CPUID_EXT_FMA,
   CPUID_EXT_XTPR, CPUID_EXT_PDCM, CPUID_EXT_PCID, CPUID_EXT_DCA,
-  CPUID_EXT_X2APIC, CPUID_EXT_TSC_DEADLINE_TIMER, CPUID_EXT_AVX,
+  CPUID_EXT_X2APIC, CPUID_EXT_TSC_DEADLINE_TIMER,
   CPUID_EXT_F16C */
 
 #ifdef TARGET_X86_64
@@ -653,9 +653,9 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
   CPUID_7_0_EBX_BMI1 | CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ADX | \
   CPUID_7_0_EBX_PCOMMIT | CPUID_7_0_EBX_CLFLUSHOPT |\
   CPUID_7_0_EBX_CLWB | CPUID_7_0_EBX_MPX | CPUID_7_0_EBX_FSGSBASE | \
-  CPUID_7_0_EBX_ERMS)
+  CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_AVX2)
   /* missing:
-  CPUID_7_0_EBX_HLE, CPUID_7_0_EBX_AVX2,
+  CPUID_7_0_EBX_HLE
   CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_7_0_ECX_FEATURES (CPUID_7_0_ECX_UMIP | CPUID_7_0_ECX_PKU | \
-- 
2.36.0

[PATCH v2 42/42] i386: Add sha512-avx test

2022-04-24 Thread Paul Brook

Include sha512 built with avx[2] in the tcg tests.

Signed-off-by: Paul Brook 
---
 tests/tcg/i386/Makefile.target | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/tests/tcg/i386/Makefile.target b/tests/tcg/i386/Makefile.target
index eb06f7eb89..a0335fff6d 100644
--- a/tests/tcg/i386/Makefile.target
+++ b/tests/tcg/i386/Makefile.target
@@ -79,7 +79,14 @@ sha512-sse: sha512.c
 run-sha512-sse: QEMU_OPTS+=-cpu max
 run-plugin-sha512-sse-with-%: QEMU_OPTS+=-cpu max
 
-TESTS+=sha512-sse
+sha512-avx: CFLAGS=-mavx2 -mavx -O3
+sha512-avx: sha512.c
+   $(CC) $(CFLAGS) $(EXTRA_CFLAGS) $< -o $@ $(LDFLAGS)
+
+run-sha512-avx: QEMU_OPTS+=-cpu max
+run-plugin-sha512-avx-with-%: QEMU_OPTS+=-cpu max
+
+TESTS+=sha512-sse sha512-avx
 
 test-avx.h: test-avx.py x86.csv
$(PYTHON) $(I386_SRC)/test-avx.py $(I386_SRC)/x86.csv $@
-- 
2.36.0

[PATCH v2 16/42] i386: Dot product AVX helper prep

2022-04-24 Thread Paul Brook

Make the dpps and dppd helpers AVX-ready

I can't see any obvious reason why dppd shouldn't work on 256 bit ymm
registers, but both AMD and Intel agree that it's xmm only.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 54 ---
 1 file changed, 46 insertions(+), 8 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index d308a1ec40..4137e6e1fa 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2366,8 +2366,10 @@ SSE_HELPER_I(helper_blendps, L, 4, FBLENDP)
 SSE_HELPER_I(helper_blendpd, Q, 2, FBLENDP)
 SSE_HELPER_I(helper_pblendw, W, 8, FBLENDP)
 
-void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, uint32_t mask)
+void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s,
+   uint32_t mask)
 {
+Reg *v = d;
 float32 prod, iresult, iresult2;
 
 /*
@@ -2375,23 +2377,23 @@ void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s, uint32_t mask)
  * to correctly round the intermediate results
  */
 if (mask & (1 << 4)) {
-iresult = float32_mul(d->ZMM_S(0), s->ZMM_S(0), >sse_status);
+iresult = float32_mul(v->ZMM_S(0), s->ZMM_S(0), >sse_status);
 } else {
 iresult = float32_zero;
 }
 if (mask & (1 << 5)) {
-prod = float32_mul(d->ZMM_S(1), s->ZMM_S(1), >sse_status);
+prod = float32_mul(v->ZMM_S(1), s->ZMM_S(1), >sse_status);
 } else {
 prod = float32_zero;
 }
 iresult = float32_add(iresult, prod, >sse_status);
 if (mask & (1 << 6)) {
-iresult2 = float32_mul(d->ZMM_S(2), s->ZMM_S(2), >sse_status);
+iresult2 = float32_mul(v->ZMM_S(2), s->ZMM_S(2), >sse_status);
 } else {
 iresult2 = float32_zero;
 }
 if (mask & (1 << 7)) {
-prod = float32_mul(d->ZMM_S(3), s->ZMM_S(3), >sse_status);
+prod = float32_mul(v->ZMM_S(3), s->ZMM_S(3), >sse_status);
 } else {
 prod = float32_zero;
 }
@@ -2402,26 +2404,62 @@ void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s, uint32_t mask)
 d->ZMM_S(1) = (mask & (1 << 1)) ? iresult : float32_zero;
 d->ZMM_S(2) = (mask & (1 << 2)) ? iresult : float32_zero;
 d->ZMM_S(3) = (mask & (1 << 3)) ? iresult : float32_zero;
+#if SHIFT == 2
+if (mask & (1 << 4)) {
+iresult = float32_mul(v->ZMM_S(4), s->ZMM_S(4), >sse_status);
+} else {
+iresult = float32_zero;
+}
+if (mask & (1 << 5)) {
+prod = float32_mul(v->ZMM_S(5), s->ZMM_S(5), >sse_status);
+} else {
+prod = float32_zero;
+}
+iresult = float32_add(iresult, prod, >sse_status);
+if (mask & (1 << 6)) {
+iresult2 = float32_mul(v->ZMM_S(6), s->ZMM_S(6), >sse_status);
+} else {
+iresult2 = float32_zero;
+}
+if (mask & (1 << 7)) {
+prod = float32_mul(v->ZMM_S(7), s->ZMM_S(7), >sse_status);
+} else {
+prod = float32_zero;
+}
+iresult2 = float32_add(iresult2, prod, >sse_status);
+iresult = float32_add(iresult, iresult2, >sse_status);
+
+d->ZMM_S(4) = (mask & (1 << 0)) ? iresult : float32_zero;
+d->ZMM_S(5) = (mask & (1 << 1)) ? iresult : float32_zero;
+d->ZMM_S(6) = (mask & (1 << 2)) ? iresult : float32_zero;
+d->ZMM_S(7) = (mask & (1 << 3)) ? iresult : float32_zero;
+#endif
 }
 
-void glue(helper_dppd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, uint32_t mask)
+#if SHIFT == 1
+/* Oddly, there is no ymm version of dppd */
+void glue(helper_dppd, SUFFIX)(CPUX86State *env,
+   Reg *d, Reg *s, uint32_t mask)
 {
+Reg *v = d;
 float64 iresult;
 
 if (mask & (1 << 4)) {
-iresult = float64_mul(d->ZMM_D(0), s->ZMM_D(0), >sse_status);
+iresult = float64_mul(v->ZMM_D(0), s->ZMM_D(0), >sse_status);
 } else {
 iresult = float64_zero;
 }
+
 if (mask & (1 << 5)) {
 iresult = float64_add(iresult,
-  float64_mul(d->ZMM_D(1), s->ZMM_D(1),
+  float64_mul(v->ZMM_D(1), s->ZMM_D(1),
   >sse_status),
   >sse_status);
 }
 d->ZMM_D(0) = (mask & (1 << 0)) ? iresult : float64_zero;
 d->ZMM_D(1) = (mask & (1 << 1)) ? iresult : float64_zero;
 }
+#endif
 
 void glue(helper_mpsadbw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s,
   uint32_t offset)
-- 
2.36.0

[PATCH v2 34/42] i386: Implement VGATHER

2022-04-24 Thread Paul Brook

These are scatter load instructions that need introduce a new "Vector SIB"
encoding.  Also a bit of hair to handle different index sizes and scaling
factors, but overall the combinatorial explosion doesn't end up too bad.

The other thing of note is probably that these also modify the mask operand.
Thankfully the operands may not overlap, and we do not have to make the whole
thing appear atomic.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 65 +++
 target/i386/ops_sse_header.h | 16 
 target/i386/tcg/translate.c  | 74 
 3 files changed, 155 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index ffcba3d02c..14a2d1bf78 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3288,6 +3288,71 @@ void glue(helper_vpmaskmovq, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *v, Reg *s)
 #endif
 }
 
+#define VGATHER_HELPER(scale)   \
+void glue(helper_vpgatherdd ## scale, SUFFIX)(CPUX86State *env, \
+Reg *d, Reg *v, Reg *s, target_ulong a0)\
+{   \
+int i;  \
+for (i = 0; i < (2 << SHIFT); i++) {\
+if (v->L(i) >> 31) {\
+target_ulong addr = a0  \
++ ((target_ulong)(int32_t)s->L(i) << scale);\
+d->L(i) = cpu_ldl_data_ra(env, addr, GETPC());  \
+}   \
+v->L(i) = 0;\
+}   \
+}   \
+void glue(helper_vpgatherdq ## scale, SUFFIX)(CPUX86State *env, \
+Reg *d, Reg *v, Reg *s, target_ulong a0)\
+{   \
+int i;  \
+for (i = 0; i < (1 << SHIFT); i++) {\
+if (v->Q(i) >> 63) {\
+target_ulong addr = a0  \
++ ((target_ulong)(int32_t)s->L(i) << scale);\
+d->Q(i) = cpu_ldq_data_ra(env, addr, GETPC());  \
+}   \
+v->Q(i) = 0;\
+}   \
+}   \
+void glue(helper_vpgatherqd ## scale, SUFFIX)(CPUX86State *env, \
+Reg *d, Reg *v, Reg *s, target_ulong a0)\
+{   \
+int i;  \
+for (i = 0; i < (1 << SHIFT); i++) {\
+if (v->L(i) >> 31) {\
+target_ulong addr = a0  \
++ ((target_ulong)(int64_t)s->Q(i) << scale);\
+d->L(i) = cpu_ldl_data_ra(env, addr, GETPC());  \
+}   \
+v->L(i) = 0;\
+}   \
+d->Q(SHIFT) = 0;\
+v->Q(SHIFT) = 0;\
+YMM_ONLY(   \
+d->Q(3) = 0;\
+v->Q(3) = 0;\
+)   \
+}   \
+void glue(helper_vpgatherqq ## scale, SUFFIX)(CPUX86State *env, \
+Reg *d, Reg *v, Reg *s, target_ulong a0)\
+{   \
+int i;  \
+for (i = 0; i < (1 << SHIFT); i++) {\
+if (v->Q(i) >> 63) {\
+target_ulong addr = a0  \
++ ((target_ulong)(int64_t)s->Q(i) << scale);\
+d->Q(i) = cpu_ldq_data_ra(env, addr, GETPC());  \
+}   \
+v->Q(i) = 0;\
+}

[PATCH v2 23/42] i386: AVX comparison helpers

2022-04-24 Thread Paul Brook

AVX includes additional a more extensive set of comparison predicates,
some of some of which our softfloat implementation does not expose directly.
Rewrite the helpers in terms of floatN_compare

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 149 ---
 target/i386/ops_sse_header.h |  47 ---
 target/i386/tcg/translate.c  |  49 +---
 3 files changed, 177 insertions(+), 68 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 48cec40074..e48dfc2fc5 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -1394,57 +1394,112 @@ void glue(helper_addsubpd, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *s)
 #endif
 }
 
-/* XXX: unordered */
-#define SSE_HELPER_CMP(name, F) \
-void glue(helper_ ## name ## ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
-{   \
-d->ZMM_L(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
-d->ZMM_L(1) = F(32, d->ZMM_S(1), s->ZMM_S(1));  \
-d->ZMM_L(2) = F(32, d->ZMM_S(2), s->ZMM_S(2));  \
-d->ZMM_L(3) = F(32, d->ZMM_S(3), s->ZMM_S(3));  \
-}   \
-\
-void helper_ ## name ## ss(CPUX86State *env, Reg *d, Reg *s)\
-{   \
-d->ZMM_L(0) = F(32, d->ZMM_S(0), s->ZMM_S(0));  \
-}   \
-\
-void glue(helper_ ## name ## pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)\
+#define SSE_HELPER_CMP_P(name, F, C)\
+void glue(helper_ ## name ## ps, SUFFIX)(CPUX86State *env,  \
+ Reg *d, Reg *s)\
 {   \
-d->ZMM_Q(0) = F(64, d->ZMM_D(0), s->ZMM_D(0));  \
-d->ZMM_Q(1) = F(64, d->ZMM_D(1), s->ZMM_D(1));  \
+Reg *v = d; \
+d->ZMM_L(0) = F(32, C, v->ZMM_S(0), s->ZMM_S(0));   \
+d->ZMM_L(1) = F(32, C, v->ZMM_S(1), s->ZMM_S(1));   \
+d->ZMM_L(2) = F(32, C, v->ZMM_S(2), s->ZMM_S(2));   \
+d->ZMM_L(3) = F(32, C, v->ZMM_S(3), s->ZMM_S(3));   \
+YMM_ONLY(   \
+d->ZMM_L(4) = F(32, C, v->ZMM_S(4), s->ZMM_S(4));   \
+d->ZMM_L(5) = F(32, C, v->ZMM_S(5), s->ZMM_S(5));   \
+d->ZMM_L(6) = F(32, C, v->ZMM_S(6), s->ZMM_S(6));   \
+d->ZMM_L(7) = F(32, C, v->ZMM_S(7), s->ZMM_S(7));   \
+)   \
 }   \
 \
-void helper_ ## name ## sd(CPUX86State *env, Reg *d, Reg *s)\
+void glue(helper_ ## name ## pd, SUFFIX)(CPUX86State *env,  \
+ Reg *d, Reg *s)\
 {   \
-d->ZMM_Q(0) = F(64, d->ZMM_D(0), s->ZMM_D(0));  \
-}
-
-#define FPU_CMPEQ(size, a, b)   \
-(float ## size ## _eq_quiet(a, b, >sse_status) ? -1 : 0)
-#define FPU_CMPLT(size, a, b)   \
-(float ## size ## _lt(a, b, >sse_status) ? -1 : 0)
-#define FPU_CMPLE(size, a, b)   \
-(float ## size ## _le(a, b, >sse_status) ? -1 : 0)
-#define FPU_CMPUNORD(size, a, b)\
-(float ## size ## _unordered_quiet(a, b, >sse_status) ? -1 : 0)
-#define FPU_CMPNEQ(size, a, b)  \
-(float ## size ## _eq_quiet(a, b, >sse_status) ? 0 : -1)
-#define FPU_CMPNLT(size, a, b)  \
-(float ## size ## _lt(a, b, >sse_status) ? 0 : -1)
-#define FPU_CMPNLE(size, a, b)  \
-(float ## size ## _le(a, b, >sse_status) ? 0 : -1)
-#define FPU_CMPORD(size, a, b)  \
-(float ## size ## _unordered_quiet(a, b, >sse_status) ? 0 : -1)
-
-SSE_HELPER_CMP(cmpeq, FPU_CMPEQ)
-SSE_HELPER_CMP(cmplt, FPU_CMPLT)
-SSE_HELPER_CMP(cmple, FPU_CMPLE)
-SSE_HELPER_CMP(cmpunord, FPU_CMPUNORD)
-SSE_HELPER_CMP(cmpneq, FPU_CMPNEQ)
-SSE_HELPER_CMP(cmpnlt, FPU_CMPNLT)
-SSE_HELPER_CMP(cmpnle, FPU_CMPNLE)

[PATCH v2 24/42] i386: Move 3DNOW decoder

2022-04-24 Thread Paul Brook

Handle 3DNOW instructions early to avoid complicating the AVX logic.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 64f026c0af..6c40df61d4 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3297,6 +3297,11 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 is_xmm = 1;
 }
 }
+if (sse_op.flags & SSE_OPF_3DNOW) {
+if (!(s->cpuid_ext2_features & CPUID_EXT2_3DNOW)) {
+goto illegal_op;
+}
+}
 /* simple MMX/SSE operation */
 if (s->flags & HF_TS_MASK) {
 gen_exception(s, EXCP07_PREX, pc_start - s->cs_base);
@@ -4761,21 +4766,20 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 rm = (modrm & 7);
 op2_offset = offsetof(CPUX86State,fpregs[rm].mmx);
 }
+if (sse_op.flags & SSE_OPF_3DNOW) {
+/* 3DNow! data insns */
+val = x86_ldub_code(env, s);
+SSEFunc_0_epp op_3dnow = sse_op_table5[val];
+if (!op_3dnow) {
+goto unknown_op;
+}
+tcg_gen_addi_ptr(s->ptr0, cpu_env, op1_offset);
+tcg_gen_addi_ptr(s->ptr1, cpu_env, op2_offset);
+op_3dnow(cpu_env, s->ptr0, s->ptr1);
+return;
+}
 }
 switch(b) {
-case 0x0f: /* 3DNow! data insns */
-val = x86_ldub_code(env, s);
-sse_fn_epp = sse_op_table5[val];
-if (!sse_fn_epp) {
-goto unknown_op;
-}
-if (!(s->cpuid_ext2_features & CPUID_EXT2_3DNOW)) {
-goto illegal_op;
-}
-tcg_gen_addi_ptr(s->ptr0, cpu_env, op1_offset);
-tcg_gen_addi_ptr(s->ptr1, cpu_env, op2_offset);
-sse_fn_epp(cpu_env, s->ptr0, s->ptr1);
-break;
 case 0x70: /* pshufx insn */
 case 0xc6: /* pshufx insn */
 val = x86_ldub_code(env, s);
-- 
2.36.0

[PATCH v2 22/42] i386: Update ops_sse_helper.h ready for 256 bit AVX

2022-04-24 Thread Paul Brook

Update ops_sse_helper.h ready for 256 bit AVX helpers

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse_header.h | 67 +---
 1 file changed, 40 insertions(+), 27 deletions(-)

diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index 203afbb5a1..63b63eb532 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -105,7 +105,7 @@ SSE_HELPER_L(pcmpeql, FCMPEQ)
 
 SSE_HELPER_W(pmullw, FMULLW)
 #if SHIFT == 0
-DEF_HELPER_3(glue(pmulhrw, SUFFIX), FMULHRW)
+DEF_HELPER_3(glue(pmulhrw, SUFFIX), void, env, Reg, Reg)
 #endif
 SSE_HELPER_W(pmulhuw, FMULHUW)
 SSE_HELPER_W(pmulhw, FMULHW)
@@ -137,23 +137,39 @@ DEF_HELPER_3(glue(pshufhw, SUFFIX), void, Reg, Reg, int)
 /* FPU ops */
 /* XXX: not accurate */
 
-DEF_HELPER_3(glue(shufps, SUFFIX), void, Reg, Reg, int)
-DEF_HELPER_3(glue(shufpd, SUFFIX), void, Reg, Reg, int)
+#define SSE_HELPER_P4(name) \
+DEF_HELPER_3(glue(name ## ps, SUFFIX), void, env, Reg, Reg) \
+DEF_HELPER_3(glue(name ## pd, SUFFIX), void, env, Reg, Reg)
+
+#define SSE_HELPER_P3(name, ...)\
+DEF_HELPER_3(glue(name ## ps, SUFFIX), void, env, Reg, Reg) \
+DEF_HELPER_3(glue(name ## pd, SUFFIX), void, env, Reg, Reg)
 
-#define SSE_HELPER_S(name, F)\
-DEF_HELPER_3(glue(name ## ps, SUFFIX), void, env, Reg, Reg)\
-DEF_HELPER_3(name ## ss, void, env, Reg, Reg)\
-DEF_HELPER_3(glue(name ## pd, SUFFIX), void, env, Reg, Reg)\
+#if SHIFT == 1
+#define SSE_HELPER_S4(name) \
+SSE_HELPER_P4(name) \
+DEF_HELPER_3(name ## ss, void, env, Reg, Reg)   \
 DEF_HELPER_3(name ## sd, void, env, Reg, Reg)
+#define SSE_HELPER_S3(name) \
+SSE_HELPER_P3(name) \
+DEF_HELPER_3(name ## ss, void, env, Reg, Reg)   \
+DEF_HELPER_3(name ## sd, void, env, Reg, Reg)
+#else
+#define SSE_HELPER_S4(name, ...) SSE_HELPER_P4(name)
+#define SSE_HELPER_S3(name, ...) SSE_HELPER_P3(name)
+#endif
+
+DEF_HELPER_3(glue(shufps, SUFFIX), void, Reg, Reg, int)
+DEF_HELPER_3(glue(shufpd, SUFFIX), void, Reg, Reg, int)
 
-SSE_HELPER_S(add, FPU_ADD)
-SSE_HELPER_S(sub, FPU_SUB)
-SSE_HELPER_S(mul, FPU_MUL)
-SSE_HELPER_S(div, FPU_DIV)
-SSE_HELPER_S(min, FPU_MIN)
-SSE_HELPER_S(max, FPU_MAX)
-SSE_HELPER_S(sqrt, FPU_SQRT)
+SSE_HELPER_S4(add)
+SSE_HELPER_S4(sub)
+SSE_HELPER_S4(mul)
+SSE_HELPER_S4(div)
+SSE_HELPER_S4(min)
+SSE_HELPER_S4(max)
 
+SSE_HELPER_S3(sqrt)
 
 DEF_HELPER_3(glue(cvtps2pd, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(cvtpd2ps, SUFFIX), void, env, Reg, Reg)
@@ -208,18 +224,12 @@ DEF_HELPER_4(extrq_i, void, env, ZMMReg, int, int)
 DEF_HELPER_3(insertq_r, void, env, ZMMReg, ZMMReg)
 DEF_HELPER_4(insertq_i, void, env, ZMMReg, int, int)
 #endif
-DEF_HELPER_3(glue(haddps, SUFFIX), void, env, ZMMReg, ZMMReg)
-DEF_HELPER_3(glue(haddpd, SUFFIX), void, env, ZMMReg, ZMMReg)
-DEF_HELPER_3(glue(hsubps, SUFFIX), void, env, ZMMReg, ZMMReg)
-DEF_HELPER_3(glue(hsubpd, SUFFIX), void, env, ZMMReg, ZMMReg)
-DEF_HELPER_3(glue(addsubps, SUFFIX), void, env, ZMMReg, ZMMReg)
-DEF_HELPER_3(glue(addsubpd, SUFFIX), void, env, ZMMReg, ZMMReg)
-
-#define SSE_HELPER_CMP(name, F)   \
-DEF_HELPER_3(glue(name ## ps, SUFFIX), void, env, Reg, Reg) \
-DEF_HELPER_3(name ## ss, void, env, Reg, Reg) \
-DEF_HELPER_3(glue(name ## pd, SUFFIX), void, env, Reg, Reg) \
-DEF_HELPER_3(name ## sd, void, env, Reg, Reg)
+
+SSE_HELPER_P4(hadd)
+SSE_HELPER_P4(hsub)
+SSE_HELPER_P4(addsub)
+
+#define SSE_HELPER_CMP(name, F) SSE_HELPER_S4(name)
 
 SSE_HELPER_CMP(cmpeq, FPU_CMPEQ)
 SSE_HELPER_CMP(cmplt, FPU_CMPLT)
@@ -381,6 +391,9 @@ DEF_HELPER_4(glue(pclmulqdq, SUFFIX), void, env, Reg, Reg, 
i32)
 #undef SSE_HELPER_W
 #undef SSE_HELPER_L
 #undef SSE_HELPER_Q
-#undef SSE_HELPER_S
+#undef SSE_HELPER_S3
+#undef SSE_HELPER_S4
+#undef SSE_HELPER_P3
+#undef SSE_HELPER_P4
 #undef SSE_HELPER_CMP
 #undef UNPCK_OP
-- 
2.36.0

[PATCH v2 18/42] i386: Misc AVX helper prep

2022-04-24 Thread Paul Brook

Fixup various vector helpers that either trivially exten to 256 bit,
or don't have 256 bit variants.

No functional changes to existing helpers

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 159 --
 1 file changed, 139 insertions(+), 20 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index d128af6cc8..3202c00572 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -641,6 +641,7 @@ void glue(helper_psadbw, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *s)
 #endif
 }
 
+#if SHIFT < 2
 void glue(helper_maskmov, SUFFIX)(CPUX86State *env, Reg *d, Reg *s,
   target_ulong a0)
 {
@@ -652,6 +653,7 @@ void glue(helper_maskmov, SUFFIX)(CPUX86State *env, Reg *d, 
Reg *s,
 }
 }
 }
+#endif
 
 void glue(helper_movl_mm_T0, SUFFIX)(Reg *d, uint32_t val)
 {
@@ -882,6 +884,13 @@ void glue(helper_cvtps2pd, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 
 s0 = s->ZMM_S(0);
 s1 = s->ZMM_S(1);
+#if SHIFT == 2
+float32 s2, s3;
+s2 = s->ZMM_S(2);
+s3 = s->ZMM_S(3);
+d->ZMM_D(2) = float32_to_float64(s2, >sse_status);
+d->ZMM_D(3) = float32_to_float64(s3, >sse_status);
+#endif
 d->ZMM_D(0) = float32_to_float64(s0, >sse_status);
 d->ZMM_D(1) = float32_to_float64(s1, >sse_status);
 }
@@ -890,9 +899,17 @@ void glue(helper_cvtpd2ps, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 {
 d->ZMM_S(0) = float64_to_float32(s->ZMM_D(0), >sse_status);
 d->ZMM_S(1) = float64_to_float32(s->ZMM_D(1), >sse_status);
+#if SHIFT == 2
+d->ZMM_S(2) = float64_to_float32(s->ZMM_D(2), >sse_status);
+d->ZMM_S(3) = float64_to_float32(s->ZMM_D(3), >sse_status);
+d->Q(2) = 0;
+d->Q(3) = 0;
+#else
 d->Q(1) = 0;
+#endif
 }
 
+#if SHIFT == 1
 void helper_cvtss2sd(CPUX86State *env, Reg *d, Reg *s)
 {
 d->ZMM_D(0) = float32_to_float64(s->ZMM_S(0), >sse_status);
@@ -902,6 +919,7 @@ void helper_cvtsd2ss(CPUX86State *env, Reg *d, Reg *s)
 {
 d->ZMM_S(0) = float64_to_float32(s->ZMM_D(0), >sse_status);
 }
+#endif
 
 /* integer to float */
 void glue(helper_cvtdq2ps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
@@ -910,6 +928,12 @@ void glue(helper_cvtdq2ps, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 d->ZMM_S(1) = int32_to_float32(s->ZMM_L(1), >sse_status);
 d->ZMM_S(2) = int32_to_float32(s->ZMM_L(2), >sse_status);
 d->ZMM_S(3) = int32_to_float32(s->ZMM_L(3), >sse_status);
+#if SHIFT == 2
+d->ZMM_S(4) = int32_to_float32(s->ZMM_L(4), >sse_status);
+d->ZMM_S(5) = int32_to_float32(s->ZMM_L(5), >sse_status);
+d->ZMM_S(6) = int32_to_float32(s->ZMM_L(6), >sse_status);
+d->ZMM_S(7) = int32_to_float32(s->ZMM_L(7), >sse_status);
+#endif
 }
 
 void glue(helper_cvtdq2pd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
@@ -918,10 +942,18 @@ void glue(helper_cvtdq2pd, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 
 l0 = (int32_t)s->ZMM_L(0);
 l1 = (int32_t)s->ZMM_L(1);
+#if SHIFT == 2
+int32_t l2, l3;
+l2 = (int32_t)s->ZMM_L(2);
+l3 = (int32_t)s->ZMM_L(3);
+d->ZMM_D(2) = int32_to_float64(l2, >sse_status);
+d->ZMM_D(3) = int32_to_float64(l3, >sse_status);
+#endif
 d->ZMM_D(0) = int32_to_float64(l0, >sse_status);
 d->ZMM_D(1) = int32_to_float64(l1, >sse_status);
 }
 
+#if SHIFT == 1
 void helper_cvtpi2ps(CPUX86State *env, ZMMReg *d, MMXReg *s)
 {
 d->ZMM_S(0) = int32_to_float32(s->MMX_L(0), >sse_status);
@@ -956,8 +988,11 @@ void helper_cvtsq2sd(CPUX86State *env, ZMMReg *d, uint64_t 
val)
 }
 #endif
 
+#endif
+
 /* float to integer */
 
+#if SHIFT == 1
 /*
  * x86 mandates that we return the indefinite integer value for the result
  * of any float-to-integer conversion that raises the 'invalid' exception.
@@ -988,6 +1023,7 @@ WRAP_FLOATCONV(int64_t, float32_to_int64, float32, 
INT64_MIN)
 WRAP_FLOATCONV(int64_t, float32_to_int64_round_to_zero, float32, INT64_MIN)
 WRAP_FLOATCONV(int64_t, float64_to_int64, float64, INT64_MIN)
 WRAP_FLOATCONV(int64_t, float64_to_int64_round_to_zero, float64, INT64_MIN)
+#endif
 
 void glue(helper_cvtps2dq, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 {
@@ -995,15 +1031,29 @@ void glue(helper_cvtps2dq, SUFFIX)(CPUX86State *env, 
ZMMReg *d, ZMMReg *s)
 d->ZMM_L(1) = x86_float32_to_int32(s->ZMM_S(1), >sse_status);
 d->ZMM_L(2) = x86_float32_to_int32(s->ZMM_S(2), >sse_status);
 d->ZMM_L(3) = x86_float32_to_int32(s->ZMM_S(3), >sse_status);
+#if SHIFT == 2
+d->ZMM_L(4) = x86_float32_to_int32(s->ZMM_S(4), >sse_status);
+d->ZMM_L(5) = x86_float32_to_int32(s->ZMM_S(5), >sse_status);
+d->ZMM_L(6) = x86_float32_to_int32(s->ZMM_S(6), >sse_status);
+d->ZMM_L(7) = x86_float32_to_int32(s->ZMM_S(7), >sse_status);
+#endif
 }
 
 void glue(helper_cvtpd2dq, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
 {
 d->ZMM_L(0) = x86_float64_to_int32(s->ZMM_D(0), >sse_status);
 d->ZMM_L(1) = x86_float64_to_int32(s->ZMM_D(1), >sse_status);
+#if SHIFT == 2
+d->ZMM_L(2) =

[PATCH v2 28/42] i386: Implement VZEROALL and VZEROUPPER

2022-04-24 Thread Paul Brook

The use the same opcode as EMMS, which I guess makes some sort of sense.
Fairly strightforward other than that.

If we were wanting to optimize out gen_clear_ymmh then this would be one of
the starting points.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 48 
 target/i386/ops_sse_header.h |  9 +++
 target/i386/tcg/translate.c  | 26 ---
 3 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index ad3312d353..a1f50f0c8b 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3071,6 +3071,54 @@ void glue(helper_aeskeygenassist, SUFFIX)(CPUX86State 
*env, Reg *d, Reg *s,
 #endif
 #endif
 
+#if SHIFT == 2
+void helper_vzeroall(CPUX86State *env)
+{
+int i;
+
+for (i = 0; i < 8; i++) {
+env->xmm_regs[i].ZMM_Q(0) = 0;
+env->xmm_regs[i].ZMM_Q(1) = 0;
+env->xmm_regs[i].ZMM_Q(2) = 0;
+env->xmm_regs[i].ZMM_Q(3) = 0;
+}
+}
+
+void helper_vzeroupper(CPUX86State *env)
+{
+int i;
+
+for (i = 0; i < 8; i++) {
+env->xmm_regs[i].ZMM_Q(2) = 0;
+env->xmm_regs[i].ZMM_Q(3) = 0;
+}
+}
+
+#ifdef TARGET_X86_64
+void helper_vzeroall_hi8(CPUX86State *env)
+{
+int i;
+
+for (i = 8; i < 16; i++) {
+env->xmm_regs[i].ZMM_Q(0) = 0;
+env->xmm_regs[i].ZMM_Q(1) = 0;
+env->xmm_regs[i].ZMM_Q(2) = 0;
+env->xmm_regs[i].ZMM_Q(3) = 0;
+}
+}
+
+void helper_vzeroupper_hi8(CPUX86State *env)
+{
+int i;
+
+for (i = 8; i < 16; i++) {
+env->xmm_regs[i].ZMM_Q(2) = 0;
+env->xmm_regs[i].ZMM_Q(3) = 0;
+}
+}
+#endif
+#endif
+
 #undef SSE_HELPER_S
 
 #undef SHIFT
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index cfcfba154b..48f0945917 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -411,6 +411,15 @@ DEF_HELPER_4(glue(aeskeygenassist, SUFFIX), void, env, 
Reg, Reg, i32)
 DEF_HELPER_5(glue(pclmulqdq, SUFFIX), void, env, Reg, Reg, Reg, i32)
 #endif
 
+#if SHIFT == 2
+DEF_HELPER_1(vzeroall, void, env)
+DEF_HELPER_1(vzeroupper, void, env)
+#ifdef TARGET_X86_64
+DEF_HELPER_1(vzeroall_hi8, void, env)
+DEF_HELPER_1(vzeroupper_hi8, void, env)
+#endif
+#endif
+
 #undef SHIFT
 #undef Reg
 #undef SUFFIX
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index bcd6d47fd0..ba70aeb039 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3455,9 +3455,29 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 return;
 }
 if (b == 0x77) {
-/* emms */
-gen_helper_emms(cpu_env);
-return;
+if (s->prefix & PREFIX_VEX) {
+CHECK_AVX(s);
+if (s->vex_l) {
+gen_helper_vzeroall(cpu_env);
+#ifdef TARGET_X86_64
+if (CODE64(s)) {
+gen_helper_vzeroall_hi8(cpu_env);
+}
+#endif
+} else {
+gen_helper_vzeroupper(cpu_env);
+#ifdef TARGET_X86_64
+if (CODE64(s)) {
+gen_helper_vzeroupper_hi8(cpu_env);
+}
+#endif
+}
+return;
+} else {
+/* emms */
+gen_helper_emms(cpu_env);
+return;
+}
 }
 /* prepare MMX state (XXX: optimize by storing fptt and fptags in
the static cpu state) */
-- 
2.36.0

[PATCH v2 13/42] i386: Destructive vector helpers for AVX

2022-04-24 Thread Paul Brook

These helpers need to take special care to avoid overwriting source values
before the wole result has been calculated.  Currently they use a dummy
Reg typed variable to store the result then assign the whole register.
This will cause 128 bit operations to corrupt the upper half of the register,
so replace it with explicit temporaries and element assignments.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 707 ++
 1 file changed, 437 insertions(+), 270 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index d0424140d9..c645d2ddbf 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -680,71 +680,85 @@ void glue(helper_movq_mm_T0, SUFFIX)(Reg *d, uint64_t val)
 }
 #endif
 
+#define SHUFFLE4(F, a, b, offset) do {  \
+r0 = a->F((order & 3) + offset);\
+r1 = a->F(((order >> 2) & 3) + offset); \
+r2 = b->F(((order >> 4) & 3) + offset); \
+r3 = b->F(((order >> 6) & 3) + offset); \
+d->F(offset) = r0;  \
+d->F(offset + 1) = r1;  \
+d->F(offset + 2) = r2;  \
+d->F(offset + 3) = r3;  \
+} while (0)
+
 #if SHIFT == 0
 void glue(helper_pshufw, SUFFIX)(Reg *d, Reg *s, int order)
 {
-Reg r;
+uint16_t r0, r1, r2, r3;
 
-r.W(0) = s->W(order & 3);
-r.W(1) = s->W((order >> 2) & 3);
-r.W(2) = s->W((order >> 4) & 3);
-r.W(3) = s->W((order >> 6) & 3);
-MOVE(*d, r);
+SHUFFLE4(W, s, s, 0);
 }
 #else
 void helper_shufps(Reg *d, Reg *s, int order)
 {
-Reg r;
+Reg *v = d;
+uint32_t r0, r1, r2, r3;
 
-r.L(0) = d->L(order & 3);
-r.L(1) = d->L((order >> 2) & 3);
-r.L(2) = s->L((order >> 4) & 3);
-r.L(3) = s->L((order >> 6) & 3);
-MOVE(*d, r);
+SHUFFLE4(L, v, s, 0);
+#if SHIFT == 2
+SHUFFLE4(L, v, s, 4);
+#endif
 }
 
 void helper_shufpd(Reg *d, Reg *s, int order)
 {
-Reg r;
+Reg *v = d;
+uint64_t r0, r1;
 
-r.Q(0) = d->Q(order & 1);
-r.Q(1) = s->Q((order >> 1) & 1);
-MOVE(*d, r);
+r0 = v->Q(order & 1);
+r1 = s->Q((order >> 1) & 1);
+d->Q(0) = r0;
+d->Q(1) = r1;
+#if SHIFT == 2
+r0 = v->Q(((order >> 2) & 1) + 2);
+r1 = s->Q(((order >> 3) & 1) + 2);
+d->Q(2) = r0;
+d->Q(3) = r1;
+#endif
 }
 
 void glue(helper_pshufd, SUFFIX)(Reg *d, Reg *s, int order)
 {
-Reg r;
+uint32_t r0, r1, r2, r3;
 
-r.L(0) = s->L(order & 3);
-r.L(1) = s->L((order >> 2) & 3);
-r.L(2) = s->L((order >> 4) & 3);
-r.L(3) = s->L((order >> 6) & 3);
-MOVE(*d, r);
+SHUFFLE4(L, s, s, 0);
+#if SHIFT ==  2
+SHUFFLE4(L, s, s, 4);
+#endif
 }
 
 void glue(helper_pshuflw, SUFFIX)(Reg *d, Reg *s, int order)
 {
-Reg r;
+uint16_t r0, r1, r2, r3;
 
-r.W(0) = s->W(order & 3);
-r.W(1) = s->W((order >> 2) & 3);
-r.W(2) = s->W((order >> 4) & 3);
-r.W(3) = s->W((order >> 6) & 3);
-r.Q(1) = s->Q(1);
-MOVE(*d, r);
+SHUFFLE4(W, s, s, 0);
+d->Q(1) = s->Q(1);
+#if SHIFT == 2
+SHUFFLE4(W, s, s, 8);
+d->Q(3) = s->Q(3);
+#endif
 }
 
 void glue(helper_pshufhw, SUFFIX)(Reg *d, Reg *s, int order)
 {
-Reg r;
+uint16_t r0, r1, r2, r3;
 
-r.Q(0) = s->Q(0);
-r.W(4) = s->W(4 + (order & 3));
-r.W(5) = s->W(4 + ((order >> 2) & 3));
-r.W(6) = s->W(4 + ((order >> 4) & 3));
-r.W(7) = s->W(4 + ((order >> 6) & 3));
-MOVE(*d, r);
+d->Q(0) = s->Q(0);
+SHUFFLE4(W, s, s, 4);
+#if SHIFT == 2
+d->Q(2) = s->Q(2);
+SHUFFLE4(W, s, s, 12);
+#endif
 }
 #endif
 
@@ -1320,156 +1334,190 @@ uint32_t glue(helper_pmovmskb, SUFFIX)(CPUX86State 
*env, Reg *s)
 return val;
 }
 
-void glue(helper_packsswb, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
-{
-Reg r;
-
-r.B(0) = satsb((int16_t)d->W(0));
-r.B(1) = satsb((int16_t)d->W(1));
-r.B(2) = satsb((int16_t)d->W(2));
-r.B(3) = satsb((int16_t)d->W(3));
-#if SHIFT == 1
-r.B(4) = satsb((int16_t)d->W(4));
-r.B(5) = satsb((int16_t)d->W(5));
-r.B(6) = satsb((int16_t)d->W(6));
-r.B(7) = satsb((int16_t)d->W(7));
-#endif
-r.B((4 << SHIFT) + 0) = satsb((int16_t)s->W(0));
-r.B((4 << SHIFT) + 1) = satsb((int16_t)s->W(1));
-r.B((4 << SHIFT) + 2) = satsb((int16_t)s->W(2));
-r.B((4 << SHIFT) + 3) = satsb((int16_t)s->W(3));
-#if SHIFT == 1
-r.B(12) = satsb((int16_t)s->W(4));
-r.B(13) = satsb((int16_t)s->W(5));
-r.B(14) = satsb((int16_t)s->W(6));
-r.B(15) = satsb((int16_t)s->W(7));
-#endif
-MOVE(*d, r);
-}
-
-void glue(helper_packuswb, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
-{
-Reg r;
-
-r.B(0) = satub((int16_t)d->W(0));
-r.B(1) = satub((int16_t)d->W(1));
-r.B(2) = satub((int16_t)d->W(2));
-r.B(3) = satub((int16_t)d->W(3));
-#if SHIFT == 1
-r.B(4) = satub((int16_t)d->W(4));
-r.B(5) = satub((int16_t)d->W(5));
-r.B(6) = satub((int16_t)d->W(6));
-r.B(7) = satub((int16_t)d->W(7));
-#endif
-r.B((4 << SHIFT) + 0) =

[PATCH v2 40/42] Enable all x86-64 cpu features in user mode

2022-04-24 Thread Paul Brook

We don't have any migration concerns for usermode emulation, so we may
as well enable all available CPU features by default.

Signed-off-by: Paul Brook 
---
 linux-user/x86_64/target_elf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linux-user/x86_64/target_elf.h b/linux-user/x86_64/target_elf.h
index 7b76a90de8..3f628f8d66 100644
--- a/linux-user/x86_64/target_elf.h
+++ b/linux-user/x86_64/target_elf.h
@@ -9,6 +9,6 @@
 #define X86_64_TARGET_ELF_H
 static inline const char *cpu_get_model(uint32_t eflags)
 {
-return "qemu64";
+return "max";
 }
 #endif
-- 
2.36.0

[PATCH v2 29/42] i386: Implement VBROADCAST

2022-04-24 Thread Paul Brook

The catch here is that these are whole vector operations (not independent 128
bit lanes). We abuse the SSE_OPF_SCALAR flag to select the memory operand
width appropriately.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 51 
 target/i386/ops_sse_header.h |  8 ++
 target/i386/tcg/translate.c  | 42 -
 3 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index a1f50f0c8b..4115c9a257 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3071,7 +3071,57 @@ void glue(helper_aeskeygenassist, SUFFIX)(CPUX86State 
*env, Reg *d, Reg *s,
 #endif
 #endif
 
+#if SHIFT >= 1
+void glue(helper_vbroadcastb, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint8_t val = s->B(0);
+int i;
+
+for (i = 0; i < 16 * SHIFT; i++) {
+d->B(i) = val;
+}
+}
+
+void glue(helper_vbroadcastw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint16_t val = s->W(0);
+int i;
+
+for (i = 0; i < 8 * SHIFT; i++) {
+d->W(i) = val;
+}
+}
+
+void glue(helper_vbroadcastl, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint32_t val = s->L(0);
+int i;
+
+for (i = 0; i < 8 * SHIFT; i++) {
+d->L(i) = val;
+}
+}
+
+void glue(helper_vbroadcastq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+uint64_t val = s->Q(0);
+d->Q(0) = val;
+d->Q(1) = val;
 #if SHIFT == 2
+d->Q(2) = val;
+d->Q(3) = val;
+#endif
+}
+
+#if SHIFT == 2
+void glue(helper_vbroadcastdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+{
+d->Q(0) = s->Q(0);
+d->Q(1) = s->Q(1);
+d->Q(2) = s->Q(0);
+d->Q(3) = s->Q(1);
+}
+
 void helper_vzeroall(CPUX86State *env)
 {
 int i;
@@ -3118,6 +3168,7 @@ void helper_vzeroupper_hi8(CPUX86State *env)
 }
 #endif
 #endif
+#endif
 
 #undef SSE_HELPER_S
 
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index 48f0945917..51e02cd4fa 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -411,7 +411,14 @@ DEF_HELPER_4(glue(aeskeygenassist, SUFFIX), void, env, 
Reg, Reg, i32)
 DEF_HELPER_5(glue(pclmulqdq, SUFFIX), void, env, Reg, Reg, Reg, i32)
 #endif
 
+/* AVX helpers */
+#if SHIFT >= 1
+DEF_HELPER_3(glue(vbroadcastb, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_3(glue(vbroadcastw, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_3(glue(vbroadcastl, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_3(glue(vbroadcastq, SUFFIX), void, env, Reg, Reg)
 #if SHIFT == 2
+DEF_HELPER_3(glue(vbroadcastdq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_1(vzeroall, void, env)
 DEF_HELPER_1(vzeroupper, void, env)
 #ifdef TARGET_X86_64
@@ -419,6 +426,7 @@ DEF_HELPER_1(vzeroall_hi8, void, env)
 DEF_HELPER_1(vzeroupper_hi8, void, env)
 #endif
 #endif
+#endif
 
 #undef SHIFT
 #undef Reg
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index ba70aeb039..59ab1dc562 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3255,6 +3255,11 @@ static const struct SSEOpHelper_table6 
sse_op_table6[256] = {
 [0x14] = BLENDV_OP(blendvps, SSE41, 0),
 [0x15] = BLENDV_OP(blendvpd, SSE41, 0),
 [0x17] = CMP_OP(ptest, SSE41),
+/* TODO:Some vbroadcast variants require AVX2 */
+[0x18] = UNARY_OP(vbroadcastl, AVX, SSE_OPF_SCALAR), /* vbroadcastss */
+[0x19] = UNARY_OP(vbroadcastq, AVX, SSE_OPF_SCALAR), /* vbroadcastsd */
+#define gen_helper_vbroadcastdq_xmm NULL
+[0x1a] = UNARY_OP(vbroadcastdq, AVX, SSE_OPF_SCALAR), /* vbroadcastf128 */
 [0x1c] = UNARY_OP_MMX(pabsb, SSSE3),
 [0x1d] = UNARY_OP_MMX(pabsw, SSSE3),
 [0x1e] = UNARY_OP_MMX(pabsd, SSSE3),
@@ -3286,6 +3291,16 @@ static const struct SSEOpHelper_table6 
sse_op_table6[256] = {
 [0x40] = BINARY_OP(pmulld, SSE41, SSE_OPF_MMX),
 #define gen_helper_phminposuw_ymm NULL
 [0x41] = UNARY_OP(phminposuw, SSE41, 0),
+/* vpbroadcastd */
+[0x58] = UNARY_OP(vbroadcastl, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
+/* vpbroadcastq */
+[0x59] = UNARY_OP(vbroadcastq, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
+/* vbroadcasti128 */
+[0x5a] = UNARY_OP(vbroadcastdq, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
+/* vpbroadcastb */
+[0x78] = UNARY_OP(vbroadcastb, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
+/* vpbroadcastw */
+[0x79] = UNARY_OP(vbroadcastw, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
 #define gen_helper_aesimc_ymm NULL
 [0xdb] = UNARY_OP(aesimc, AES, 0),
 [0xdc] = BINARY_OP(aesenc, AES, 0),
@@ -4323,6 +4338,24 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 op2_offset = offsetof(CPUX86State, xmm_t0);
 gen_lea_modrm(env, s, modrm);
 switch (b) {
+case 0x78: /* vpbroadcastb */
+size = 8;
+break;
+case 0x79: /* vpbroadcastw */
+size = 16;
+break;
+

[PATCH v2 11/42] i386: Rewrite simple integer vector helpers

2022-04-24 Thread Paul Brook

Rewrite the "simple" vector integer helpers in preperation for AVX support.

While the current code is able to use the same prototype for unary
(a = F(b)) and binary (a = F(b, c)) operations, future changes will cause
them to diverge.

No functional changes to existing helpers

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 180 --
 1 file changed, 137 insertions(+), 43 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 9297c96d04..bb9cbf9ead 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -275,61 +275,148 @@ void glue(helper_pslldq, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *c)
 }
 #endif
 
-#define SSE_HELPER_B(name, F)   \
+#define SSE_HELPER_1(name, elem, num, F)   \
 void glue(name, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)   \
 {   \
-d->B(0) = F(d->B(0), s->B(0));  \
-d->B(1) = F(d->B(1), s->B(1));  \
-d->B(2) = F(d->B(2), s->B(2));  \
-d->B(3) = F(d->B(3), s->B(3));  \
-d->B(4) = F(d->B(4), s->B(4));  \
-d->B(5) = F(d->B(5), s->B(5));  \
-d->B(6) = F(d->B(6), s->B(6));  \
-d->B(7) = F(d->B(7), s->B(7));  \
+d->elem(0) = F(s->elem(0)); \
+d->elem(1) = F(s->elem(1)); \
+if ((num << SHIFT) > 2) {   \
+d->elem(2) = F(s->elem(2)); \
+d->elem(3) = F(s->elem(3)); \
+}   \
+if ((num << SHIFT) > 4) {   \
+d->elem(4) = F(s->elem(4)); \
+d->elem(5) = F(s->elem(5)); \
+d->elem(6) = F(s->elem(6)); \
+d->elem(7) = F(s->elem(7)); \
+}   \
+if ((num << SHIFT) > 8) {   \
+d->elem(8) = F(s->elem(8)); \
+d->elem(9) = F(s->elem(9)); \
+d->elem(10) = F(s->elem(10));   \
+d->elem(11) = F(s->elem(11));   \
+d->elem(12) = F(s->elem(12));   \
+d->elem(13) = F(s->elem(13));   \
+d->elem(14) = F(s->elem(14));   \
+d->elem(15) = F(s->elem(15));   \
+}   \
+if ((num << SHIFT) > 16) {  \
+d->elem(16) = F(s->elem(16));   \
+d->elem(17) = F(s->elem(17));   \
+d->elem(18) = F(s->elem(18));   \
+d->elem(19) = F(s->elem(19));   \
+d->elem(20) = F(s->elem(20));   \
+d->elem(21) = F(s->elem(21));   \
+d->elem(22) = F(s->elem(22));   \
+d->elem(23) = F(s->elem(23));   \
+d->elem(24) = F(s->elem(24));   \
+d->elem(25) = F(s->elem(25));   \
+d->elem(26) = F(s->elem(26));   \
+d->elem(27) = F(s->elem(27));   \
+d->elem(28) = F(s->elem(28));   \
+d->elem(29) = F(s->elem(29));   \
+d->elem(30) = F(s->elem(30));   \
+d->elem(31) = F(s->elem(31));   \
+}   \
+}
+
+#define SSE_HELPER_B(name, F)   \
+void glue(name, SUFFIX)(CPUX86State *env, Reg *d, Reg *s) \
+{   \
+Reg *v = d; \
+d->B(0) = F(v->B(0), s->B(0));  \
+d->B(1) = F(v->B(1), s->B(1));  \
+d->B(2) = F(v->B(2), s->B(2));  \
+d->B(3) = F(v->B(3), s->B(3));  \
+d->B(4) = F(v->B(4), s->B(4));  \
+d->B(5) = F(v->B(5), s->B(5));  \
+d->B(6) = F(v->B(6), s->B(6));  \
+d->B(7) = F(v->B(7), s->B(7));  \

[PATCH v2 38/42] i386: Implement VPBLENDD

2022-04-24 Thread Paul Brook

This is semantically eqivalent to VBLENDPS.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 95ecdea8fe..73f3842c36 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3353,6 +3353,7 @@ static const struct SSEOpHelper_table7 sse_op_table7[256] 
= {
 #define gen_helper_vpermq_xmm NULL
 [0x00] = UNARY_OP(vpermq, AVX, SSE_OPF_AVX2),
 [0x01] = UNARY_OP(vpermq, AVX, SSE_OPF_AVX2), /* vpermpd */
+[0x02] = BINARY_OP(blendps, AVX, SSE_OPF_AVX2), /* vpblendd */
 [0x04] = UNARY_OP(vpermilps_imm, AVX, 0),
 [0x05] = UNARY_OP(vpermilpd_imm, AVX, 0),
 #define gen_helper_vpermdq_xmm NULL
-- 
2.36.0

[PATCH v2 17/42] i386: Destructive FP helpers for AVX

2022-04-24 Thread Paul Brook

Perpare the horizontal atithmetic vector helpers for AVX
These currently use a dummy Reg typed variable to store the result then
assign the whole register.  This will cause 128 bit operations to corrupt
the upper half of the register, so replace it with explicit temporaries
and element assignments.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 96 +++
 1 file changed, 70 insertions(+), 26 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 4137e6e1fa..d128af6cc8 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -1196,44 +1196,88 @@ void helper_insertq_i(CPUX86State *env, ZMMReg *d, int 
index, int length)
 d->ZMM_Q(0) = helper_insertq(d->ZMM_Q(0), index, length);
 }
 
-void glue(helper_haddps, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_haddps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
-ZMMReg r;
-
-r.ZMM_S(0) = float32_add(d->ZMM_S(0), d->ZMM_S(1), >sse_status);
-r.ZMM_S(1) = float32_add(d->ZMM_S(2), d->ZMM_S(3), >sse_status);
-r.ZMM_S(2) = float32_add(s->ZMM_S(0), s->ZMM_S(1), >sse_status);
-r.ZMM_S(3) = float32_add(s->ZMM_S(2), s->ZMM_S(3), >sse_status);
-MOVE(*d, r);
+Reg *v = d;
+float32 r0, r1, r2, r3;
+
+r0 = float32_add(v->ZMM_S(0), v->ZMM_S(1), >sse_status);
+r1 = float32_add(v->ZMM_S(2), v->ZMM_S(3), >sse_status);
+r2 = float32_add(s->ZMM_S(0), s->ZMM_S(1), >sse_status);
+r3 = float32_add(s->ZMM_S(2), s->ZMM_S(3), >sse_status);
+d->ZMM_S(0) = r0;
+d->ZMM_S(1) = r1;
+d->ZMM_S(2) = r2;
+d->ZMM_S(3) = r3;
+#if SHIFT == 2
+r0 = float32_add(v->ZMM_S(4), v->ZMM_S(5), >sse_status);
+r1 = float32_add(v->ZMM_S(6), v->ZMM_S(7), >sse_status);
+r2 = float32_add(s->ZMM_S(4), s->ZMM_S(5), >sse_status);
+r3 = float32_add(s->ZMM_S(6), s->ZMM_S(7), >sse_status);
+d->ZMM_S(4) = r0;
+d->ZMM_S(5) = r1;
+d->ZMM_S(6) = r2;
+d->ZMM_S(7) = r3;
+#endif
 }
 
-void glue(helper_haddpd, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_haddpd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
-ZMMReg r;
+Reg *v = d;
+float64 r0, r1;
 
-r.ZMM_D(0) = float64_add(d->ZMM_D(0), d->ZMM_D(1), >sse_status);
-r.ZMM_D(1) = float64_add(s->ZMM_D(0), s->ZMM_D(1), >sse_status);
-MOVE(*d, r);
+r0 = float64_add(v->ZMM_D(0), v->ZMM_D(1), >sse_status);
+r1 = float64_add(s->ZMM_D(0), s->ZMM_D(1), >sse_status);
+d->ZMM_D(0) = r0;
+d->ZMM_D(1) = r1;
+#if SHIFT == 2
+r0 = float64_add(v->ZMM_D(2), v->ZMM_D(3), >sse_status);
+r1 = float64_add(s->ZMM_D(2), s->ZMM_D(3), >sse_status);
+d->ZMM_D(2) = r0;
+d->ZMM_D(3) = r1;
+#endif
 }
 
-void glue(helper_hsubps, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_hsubps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
-ZMMReg r;
-
-r.ZMM_S(0) = float32_sub(d->ZMM_S(0), d->ZMM_S(1), >sse_status);
-r.ZMM_S(1) = float32_sub(d->ZMM_S(2), d->ZMM_S(3), >sse_status);
-r.ZMM_S(2) = float32_sub(s->ZMM_S(0), s->ZMM_S(1), >sse_status);
-r.ZMM_S(3) = float32_sub(s->ZMM_S(2), s->ZMM_S(3), >sse_status);
-MOVE(*d, r);
+Reg *v = d;
+float32 r0, r1, r2, r3;
+
+r0 = float32_sub(v->ZMM_S(0), v->ZMM_S(1), >sse_status);
+r1 = float32_sub(v->ZMM_S(2), v->ZMM_S(3), >sse_status);
+r2 = float32_sub(s->ZMM_S(0), s->ZMM_S(1), >sse_status);
+r3 = float32_sub(s->ZMM_S(2), s->ZMM_S(3), >sse_status);
+d->ZMM_S(0) = r0;
+d->ZMM_S(1) = r1;
+d->ZMM_S(2) = r2;
+d->ZMM_S(3) = r3;
+#if SHIFT == 2
+r0 = float32_sub(v->ZMM_S(4), v->ZMM_S(5), >sse_status);
+r1 = float32_sub(v->ZMM_S(6), v->ZMM_S(7), >sse_status);
+r2 = float32_sub(s->ZMM_S(4), s->ZMM_S(5), >sse_status);
+r3 = float32_sub(s->ZMM_S(6), s->ZMM_S(7), >sse_status);
+d->ZMM_S(4) = r0;
+d->ZMM_S(5) = r1;
+d->ZMM_S(6) = r2;
+d->ZMM_S(7) = r3;
+#endif
 }
 
-void glue(helper_hsubpd, SUFFIX)(CPUX86State *env, ZMMReg *d, ZMMReg *s)
+void glue(helper_hsubpd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
-ZMMReg r;
+Reg *v = d;
+float64 r0, r1;
 
-r.ZMM_D(0) = float64_sub(d->ZMM_D(0), d->ZMM_D(1), >sse_status);
-r.ZMM_D(1) = float64_sub(s->ZMM_D(0), s->ZMM_D(1), >sse_status);
-MOVE(*d, r);
+r0 = float64_sub(v->ZMM_D(0), v->ZMM_D(1), >sse_status);
+r1 = float64_sub(s->ZMM_D(0), s->ZMM_D(1), >sse_status);
+d->ZMM_D(0) = r0;
+d->ZMM_D(1) = r1;
+#if SHIFT == 2
+r0 = float64_sub(v->ZMM_D(2), v->ZMM_D(3), >sse_status);
+r1 = float64_sub(s->ZMM_D(2), s->ZMM_D(3), >sse_status);
+d->ZMM_D(2) = r0;
+d->ZMM_D(3) = r1;
+#endif
 }
 
 void glue(helper_addsubps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
-- 
2.36.0

[PATCH v2 30/42] i386: Implement VPERMIL

2022-04-24 Thread Paul Brook

Some potentially surprising details when comparing vpermilpd v.s. vpermilps,
but overall pretty straightforward.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 82 
 target/i386/ops_sse_header.h |  4 ++
 target/i386/tcg/translate.c  |  4 ++
 3 files changed, 90 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 4115c9a257..9b92b9790a 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3113,6 +3113,88 @@ void glue(helper_vbroadcastq, SUFFIX)(CPUX86State *env, 
Reg *d, Reg *s)
 #endif
 }
 
+void glue(helper_vpermilpd, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
+{
+uint64_t r0, r1;
+
+r0 = v->Q((s->Q(0) >> 1) & 1);
+r1 = v->Q((s->Q(1) >> 1) & 1);
+d->Q(0) = r0;
+d->Q(1) = r1;
+#if SHIFT == 2
+r0 = v->Q(((s->Q(2) >> 1) & 1) + 2);
+r1 = v->Q(((s->Q(3) >> 1) & 1) + 2);
+d->Q(2) = r0;
+d->Q(3) = r1;
+#endif
+}
+
+void glue(helper_vpermilps, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
+{
+uint32_t r0, r1, r2, r3;
+
+r0 = v->L(s->L(0) & 3);
+r1 = v->L(s->L(1) & 3);
+r2 = v->L(s->L(2) & 3);
+r3 = v->L(s->L(3) & 3);
+d->L(0) = r0;
+d->L(1) = r1;
+d->L(2) = r2;
+d->L(3) = r3;
+#if SHIFT == 2
+r0 = v->L((s->L(4) & 3) + 4);
+r1 = v->L((s->L(5) & 3) + 4);
+r2 = v->L((s->L(6) & 3) + 4);
+r3 = v->L((s->L(7) & 3) + 4);
+d->L(4) = r0;
+d->L(5) = r1;
+d->L(6) = r2;
+d->L(7) = r3;
+#endif
+}
+
+void glue(helper_vpermilpd_imm, SUFFIX)(CPUX86State *env,
+Reg *d, Reg *s, uint32_t order)
+{
+uint64_t r0, r1;
+
+r0 = s->Q((order >> 0) & 1);
+r1 = s->Q((order >> 1) & 1);
+d->Q(0) = r0;
+d->Q(1) = r1;
+#if SHIFT == 2
+r0 = s->Q(((order >> 2) & 1) + 2);
+r1 = s->Q(((order >> 3) & 1) + 2);
+d->Q(2) = r0;
+d->Q(3) = r1;
+#endif
+}
+
+void glue(helper_vpermilps_imm, SUFFIX)(CPUX86State *env,
+Reg *d, Reg *s, uint32_t order)
+{
+uint32_t r0, r1, r2, r3;
+
+r0 = s->L((order >> 0) & 3);
+r1 = s->L((order >> 2) & 3);
+r2 = s->L((order >> 4) & 3);
+r3 = s->L((order >> 6) & 3);
+d->L(0) = r0;
+d->L(1) = r1;
+d->L(2) = r2;
+d->L(3) = r3;
+#if SHIFT == 2
+r0 = s->L(((order >> 0) & 3) + 4);
+r1 = s->L(((order >> 2) & 3) + 4);
+r2 = s->L(((order >> 4) & 3) + 4);
+r3 = s->L(((order >> 6) & 3) + 4);
+d->L(4) = r0;
+d->L(5) = r1;
+d->L(6) = r2;
+d->L(7) = r3;
+#endif
+}
+
 #if SHIFT == 2
 void glue(helper_vbroadcastdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index 51e02cd4fa..c52169a030 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -417,6 +417,10 @@ DEF_HELPER_3(glue(vbroadcastb, SUFFIX), void, env, Reg, 
Reg)
 DEF_HELPER_3(glue(vbroadcastw, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(vbroadcastl, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(vbroadcastq, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_4(glue(vpermilpd, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpermilps, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpermilpd_imm, SUFFIX), void, env, Reg, Reg, i32)
+DEF_HELPER_4(glue(vpermilps_imm, SUFFIX), void, env, Reg, Reg, i32)
 #if SHIFT == 2
 DEF_HELPER_3(glue(vbroadcastdq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_1(vzeroall, void, env)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 59ab1dc562..358c3ecb0b 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3251,6 +3251,8 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x09] = BINARY_OP_MMX(psignw, SSSE3),
 [0x0a] = BINARY_OP_MMX(psignd, SSSE3),
 [0x0b] = BINARY_OP_MMX(pmulhrsw, SSSE3),
+[0x0c] = BINARY_OP(vpermilps, AVX, 0),
+[0x0d] = BINARY_OP(vpermilpd, AVX, 0),
 [0x10] = BLENDV_OP(pblendvb, SSE41, SSE_OPF_MMX),
 [0x14] = BLENDV_OP(blendvps, SSE41, 0),
 [0x15] = BLENDV_OP(blendvpd, SSE41, 0),
@@ -3311,6 +3313,8 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 
 /* prefix [66] 0f 3a */
 static const struct SSEOpHelper_table7 sse_op_table7[256] = {
+[0x04] = UNARY_OP(vpermilps_imm, AVX, 0),
+[0x05] = UNARY_OP(vpermilpd_imm, AVX, 0),
 [0x08] = UNARY_OP(roundps, SSE41, 0),
 [0x09] = UNARY_OP(roundpd, SSE41, 0),
 #define gen_helper_roundss_ymm NULL
-- 
2.36.0

[PATCH v2 33/42] i386: Implement VMASKMOV

2022-04-24 Thread Paul Brook

Decoding these is a bit messy, but at least the integer and float variants
have the same semantics once decoded.

We don't try and be clever with the load forms, instead load the whole
vector then mask out the elements we want.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 48 
 target/i386/ops_sse_header.h |  4 +++
 target/i386/tcg/translate.c  | 34 +
 3 files changed, 86 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index edf14a25d7..ffcba3d02c 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -3240,6 +3240,54 @@ void glue(helper_vtestpd, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s)
 CC_SRC = ((zf >> 63) ? 0 : CC_Z) | ((cf >> 63) ? 0 : CC_C);
 }
 
+void glue(helper_vpmaskmovd_st, SUFFIX)(CPUX86State *env,
+Reg *s, Reg *v, target_ulong a0)
+{
+int i;
+
+for (i = 0; i < (2 << SHIFT); i++) {
+if (v->L(i) >> 31) {
+cpu_stl_data_ra(env, a0 + i * 4, s->L(i), GETPC());
+}
+}
+}
+
+void glue(helper_vpmaskmovq_st, SUFFIX)(CPUX86State *env,
+Reg *s, Reg *v, target_ulong a0)
+{
+int i;
+
+for (i = 0; i < (1 << SHIFT); i++) {
+if (v->Q(i) >> 63) {
+cpu_stq_data_ra(env, a0 + i * 8, s->Q(i), GETPC());
+}
+}
+}
+
+void glue(helper_vpmaskmovd, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
+{
+d->L(0) = (v->L(0) >> 31) ? s->L(0) : 0;
+d->L(1) = (v->L(1) >> 31) ? s->L(1) : 0;
+d->L(2) = (v->L(2) >> 31) ? s->L(2) : 0;
+d->L(3) = (v->L(3) >> 31) ? s->L(3) : 0;
+#if SHIFT == 2
+d->L(4) = (v->L(4) >> 31) ? s->L(4) : 0;
+d->L(5) = (v->L(5) >> 31) ? s->L(5) : 0;
+d->L(6) = (v->L(6) >> 31) ? s->L(6) : 0;
+d->L(7) = (v->L(7) >> 31) ? s->L(7) : 0;
+#endif
+}
+
+void glue(helper_vpmaskmovq, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s)
+{
+d->Q(0) = (v->Q(0) >> 63) ? s->Q(0) : 0;
+d->Q(1) = (v->Q(1) >> 63) ? s->Q(1) : 0;
+#if SHIFT == 2
+d->Q(2) = (v->Q(2) >> 63) ? s->Q(2) : 0;
+d->Q(3) = (v->Q(3) >> 63) ? s->Q(3) : 0;
+#endif
+}
+
 #if SHIFT == 2
 void glue(helper_vbroadcastdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
 {
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index 8b93b8e6d6..a7a6bf6b10 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -429,6 +429,10 @@ DEF_HELPER_4(glue(vpsravq, SUFFIX), void, env, Reg, Reg, 
Reg)
 DEF_HELPER_4(glue(vpsllvq, SUFFIX), void, env, Reg, Reg, Reg)
 DEF_HELPER_3(glue(vtestps, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_3(glue(vtestpd, SUFFIX), void, env, Reg, Reg)
+DEF_HELPER_4(glue(vpmaskmovd_st, SUFFIX), void, env, Reg, Reg, tl)
+DEF_HELPER_4(glue(vpmaskmovq_st, SUFFIX), void, env, Reg, Reg, tl)
+DEF_HELPER_4(glue(vpmaskmovd, SUFFIX), void, env, Reg, Reg, Reg)
+DEF_HELPER_4(glue(vpmaskmovq, SUFFIX), void, env, Reg, Reg, Reg)
 #if SHIFT == 2
 DEF_HELPER_3(glue(vbroadcastdq, SUFFIX), void, env, Reg, Reg)
 DEF_HELPER_1(vzeroall, void, env)
diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 2fbb7bfcad..e00195d301 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3277,6 +3277,10 @@ static const struct SSEOpHelper_table6 
sse_op_table6[256] = {
 [0x29] = BINARY_OP(pcmpeqq, SSE41, SSE_OPF_MMX),
 [0x2a] = SPECIAL_OP(SSE41), /* movntqda */
 [0x2b] = BINARY_OP(packusdw, SSE41, SSE_OPF_MMX),
+[0x2c] = BINARY_OP(vpmaskmovd, AVX, 0), /* vmaskmovps */
+[0x2d] = BINARY_OP(vpmaskmovq, AVX, 0), /* vmaskmovpd */
+[0x2e] = SPECIAL_OP(AVX), /* vmaskmovps */
+[0x2f] = SPECIAL_OP(AVX), /* vmaskmovpd */
 [0x30] = UNARY_OP(pmovzxbw, SSE41, SSE_OPF_MMX),
 [0x31] = UNARY_OP(pmovzxbd, SSE41, SSE_OPF_MMX),
 [0x32] = UNARY_OP(pmovzxbq, SSE41, SSE_OPF_MMX),
@@ -3308,6 +3312,9 @@ static const struct SSEOpHelper_table6 sse_op_table6[256] 
= {
 [0x78] = UNARY_OP(vbroadcastb, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
 /* vpbroadcastw */
 [0x79] = UNARY_OP(vbroadcastw, AVX, SSE_OPF_SCALAR | SSE_OPF_MMX),
+/* vpmaskmovd, vpmaskmovq */
+[0x8c] = BINARY_OP(vpmaskmovd, AVX, SSE_OPF_AVX2),
+[0x8e] = SPECIAL_OP(AVX), /* vpmaskmovd, vpmaskmovq */
 #define gen_helper_aesimc_ymm NULL
 [0xdb] = UNARY_OP(aesimc, AES, 0),
 [0xdc] = BINARY_OP(aesenc, AES, 0),
@@ -3369,6 +3376,11 @@ static const SSEFunc_0_eppp sse_op_table8[3][2] = {
 SSE_OP(vpsravq),
 SSE_OP(vpsllvq),
 };
+
+static const SSEFunc_0_eppt sse_op_table9[2][2] = {
+SSE_OP(vpmaskmovd_st),
+SSE_OP(vpmaskmovq_st),
+};
 #undef SSE_OP
 
 /* VEX prefix not allowed */
@@ -4394,6 +4406,22 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 gen_clear_ymmh(s, reg);
 }
 return;
+case 0x2e: /* maskmovpd */
+b1 = 0;
+

[PATCH v2 37/42] i386: Implement VBLENDV

2022-04-24 Thread Paul Brook

The AVX variants of the BLENDV instructions use a different opcode prefix
to support the additional operands. We already modified the helper functions
in anticipation of this.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 4072fa28d3..95ecdea8fe 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3384,6 +3384,9 @@ static const struct SSEOpHelper_table7 sse_op_table7[256] 
= {
 [0x42] = BINARY_OP(mpsadbw, SSE41, SSE_OPF_MMX),
 [0x44] = BINARY_OP(pclmulqdq, PCLMULQDQ, 0),
 [0x46] = BINARY_OP(vpermdq, AVX, SSE_OPF_AVX2), /* vperm2i128 */
+[0x4a] = BLENDV_OP(blendvps, AVX, 0),
+[0x4b] = BLENDV_OP(blendvpd, AVX, 0),
+[0x4c] = BLENDV_OP(pblendvb, AVX, SSE_OPF_MMX),
 #define gen_helper_pcmpestrm_ymm NULL
 [0x60] = CMP_OP(pcmpestrm, SSE42),
 #define gen_helper_pcmpestri_ymm NULL
@@ -5268,6 +5271,10 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 
 /* SSE */
+if (op7.flags & SSE_OPF_BLENDV && !(s->prefix & PREFIX_VEX)) {
+/* Only VEX encodings are valid for these blendv opcodes */
+goto illegal_op;
+}
 op1_offset = ZMM_OFFSET(reg);
 if (mod == 3) {
 op2_offset = ZMM_OFFSET(rm | REX_B(s));
@@ -5316,8 +5323,15 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 op7.fn[b1].op1(cpu_env, s->ptr0, s->ptr1, tcg_const_i32(val));
 } else {
 tcg_gen_addi_ptr(s->ptr2, cpu_env, v_offset);
-op7.fn[b1].op2(cpu_env, s->ptr0, s->ptr2, s->ptr1,
-   tcg_const_i32(val));
+if (op7.flags & SSE_OPF_BLENDV) {
+TCGv_ptr mask = tcg_temp_new_ptr();
+tcg_gen_addi_ptr(mask, cpu_env, ZMM_OFFSET(val >> 4));
+op7.fn[b1].op3(cpu_env, s->ptr0, s->ptr2, s->ptr1, mask);
+tcg_temp_free_ptr(mask);
+} else {
+op7.fn[b1].op2(cpu_env, s->ptr0, s->ptr2, s->ptr1,
+   tcg_const_i32(val));
+}
 }
 if ((op7.flags & SSE_OPF_CMP) == 0 && s->vex_l == 0) {
 gen_clear_ymmh(s, reg);
-- 
2.36.0

[PATCH v2 10/42] i386: Rewrite vector shift helper

2022-04-24 Thread Paul Brook

Rewrite the vector shift helpers in preperation for AVX support (3 operand
form and 256 bit vectors).

For now keep the existing two operand interface.

No functional changes to existing helpers.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 250 ++
 1 file changed, 133 insertions(+), 117 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 23daab6b50..9297c96d04 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -63,199 +63,215 @@
 #define MOVE(d, r) memcpy(&(d).B(0), &(r).B(0), SIZE)
 #endif
 
-void glue(helper_psrlw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+#if SHIFT == 0
+#define SHIFT_HELPER_BODY(n, elem, F) do {  \
+d->elem(0) = F(s->elem(0), shift);  \
+if ((n) > 1) {  \
+d->elem(1) = F(s->elem(1), shift);  \
+}   \
+if ((n) > 2) {  \
+d->elem(2) = F(s->elem(2), shift);  \
+d->elem(3) = F(s->elem(3), shift);  \
+}   \
+if ((n) > 4) {  \
+d->elem(4) = F(s->elem(4), shift);  \
+d->elem(5) = F(s->elem(5), shift);  \
+d->elem(6) = F(s->elem(6), shift);  \
+d->elem(7) = F(s->elem(7), shift);  \
+}   \
+if ((n) > 8) {  \
+d->elem(8) = F(s->elem(8), shift);  \
+d->elem(9) = F(s->elem(9), shift);  \
+d->elem(10) = F(s->elem(10), shift);\
+d->elem(11) = F(s->elem(11), shift);\
+d->elem(12) = F(s->elem(12), shift);\
+d->elem(13) = F(s->elem(13), shift);\
+d->elem(14) = F(s->elem(14), shift);\
+d->elem(15) = F(s->elem(15), shift);\
+}   \
+} while (0)
+
+#define FPSRL(x, c) ((x) >> shift)
+#define FPSRAW(x, c) ((int16_t)(x) >> shift)
+#define FPSRAL(x, c) ((int32_t)(x) >> shift)
+#define FPSLL(x, c) ((x) << shift)
+#endif
+
+void glue(helper_psrlw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
 {
+Reg *s = d;
 int shift;
-
-if (s->Q(0) > 15) {
+if (c->Q(0) > 15) {
 d->Q(0) = 0;
-#if SHIFT == 1
-d->Q(1) = 0;
-#endif
+XMM_ONLY(d->Q(1) = 0;)
+YMM_ONLY(
+d->Q(2) = 0;
+d->Q(3) = 0;
+)
 } else {
-shift = s->B(0);
-d->W(0) >>= shift;
-d->W(1) >>= shift;
-d->W(2) >>= shift;
-d->W(3) >>= shift;
-#if SHIFT == 1
-d->W(4) >>= shift;
-d->W(5) >>= shift;
-d->W(6) >>= shift;
-d->W(7) >>= shift;
-#endif
+shift = c->B(0);
+SHIFT_HELPER_BODY(4 << SHIFT, W, FPSRL);
 }
 }
 
-void glue(helper_psraw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_psllw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
 {
+Reg *s = d;
 int shift;
-
-if (s->Q(0) > 15) {
-shift = 15;
+if (c->Q(0) > 15) {
+d->Q(0) = 0;
+XMM_ONLY(d->Q(1) = 0;)
+YMM_ONLY(
+d->Q(2) = 0;
+d->Q(3) = 0;
+)
 } else {
-shift = s->B(0);
+shift = c->B(0);
+SHIFT_HELPER_BODY(4 << SHIFT, W, FPSLL);
 }
-d->W(0) = (int16_t)d->W(0) >> shift;
-d->W(1) = (int16_t)d->W(1) >> shift;
-d->W(2) = (int16_t)d->W(2) >> shift;
-d->W(3) = (int16_t)d->W(3) >> shift;
-#if SHIFT == 1
-d->W(4) = (int16_t)d->W(4) >> shift;
-d->W(5) = (int16_t)d->W(5) >> shift;
-d->W(6) = (int16_t)d->W(6) >> shift;
-d->W(7) = (int16_t)d->W(7) >> shift;
-#endif
 }
 
-void glue(helper_psllw, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_psraw, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
 {
+Reg *s = d;
 int shift;
-
-if (s->Q(0) > 15) {
-d->Q(0) = 0;
-#if SHIFT == 1
-d->Q(1) = 0;
-#endif
+if (c->Q(0) > 15) {
+shift = 15;
 } else {
-shift = s->B(0);
-d->W(0) <<= shift;
-d->W(1) <<= shift;
-d->W(2) <<= shift;
-d->W(3) <<= shift;
-#if SHIFT == 1
-d->W(4) <<= shift;
-d->W(5) <<= shift;
-d->W(6) <<= shift;
-d->W(7) <<= shift;
-#endif
+shift = c->B(0);
 }
+SHIFT_HELPER_BODY(4 << SHIFT, W, FPSRAW);
 }
 
-void glue(helper_psrld, SUFFIX)(CPUX86State *env, Reg *d, Reg *s)
+void glue(helper_psrld, SUFFIX)(CPUX86State *env, Reg *d, Reg *c)
 {
+Reg *s = d;
 int shift;
-
-if (s->Q(0) > 31) {
+if (c->Q(0) > 31) {
 d->Q(0) = 0;
-#if SHIFT == 1
-d->Q(1) = 0;
-#endif
+XMM_ONLY(d->Q(1) = 0;)
+YMM_ONLY(
+d->Q(2) = 0;
+d->Q(3) = 0;
+)
 } else {
-shift = s->B(0);
-d->L(0) >>= shift;
-d->L(1) >>= shift;
-#if SHIFT == 1
-d->L(2) >>= shift;

[PATCH v2 05/42] i386: Rework sse_op_table6/7

2022-04-24 Thread Paul Brook

Add a flags field each row in sse_op_table6 and sse_op_table7.

Initially this is only used as a replacement for the magic
SSE41_SPECIAL pointer.  The other flags will become relevant
as the rest of the avx implementation is built out.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 232 
 1 file changed, 132 insertions(+), 100 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 7fec582358..5335b86c01 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2977,7 +2977,6 @@ static const struct SSEOpHelper_table1 sse_op_table1[256] 
= {
 #undef SSE_SPECIAL
 
 #define MMX_OP2(x) { gen_helper_ ## x ## _mmx, gen_helper_ ## x ## _xmm }
-#define SSE_SPECIAL_FN ((void *)1)
 
 static const SSEFunc_0_epp sse_op_table2[3 * 8][2] = {
 [0 + 2] = MMX_OP2(psrlw),
@@ -3061,113 +3060,134 @@ static const SSEFunc_0_epp sse_op_table5[256] = {
 [0xbf] = gen_helper_pavgb_mmx /* pavgusb */
 };
 
-struct SSEOpHelper_epp {
+struct SSEOpHelper_table6 {
 SSEFunc_0_epp op[2];
 uint32_t ext_mask;
+int flags;
 };
 
-struct SSEOpHelper_eppi {
+struct SSEOpHelper_table7 {
 SSEFunc_0_eppi op[2];
 uint32_t ext_mask;
+int flags;
 };
 
-#define SSSE3_OP(x) { MMX_OP2(x), CPUID_EXT_SSSE3 }
-#define SSE41_OP(x) { { NULL, gen_helper_ ## x ## _xmm }, CPUID_EXT_SSE41 }
-#define SSE42_OP(x) { { NULL, gen_helper_ ## x ## _xmm }, CPUID_EXT_SSE42 }
-#define SSE41_SPECIAL { { NULL, SSE_SPECIAL_FN }, CPUID_EXT_SSE41 }
-#define PCLMULQDQ_OP(x) { { NULL, gen_helper_ ## x ## _xmm }, \
-CPUID_EXT_PCLMULQDQ }
-#define AESNI_OP(x) { { NULL, gen_helper_ ## x ## _xmm }, CPUID_EXT_AES }
-
-static const struct SSEOpHelper_epp sse_op_table6[256] = {
-[0x00] = SSSE3_OP(pshufb),
-[0x01] = SSSE3_OP(phaddw),
-[0x02] = SSSE3_OP(phaddd),
-[0x03] = SSSE3_OP(phaddsw),
-[0x04] = SSSE3_OP(pmaddubsw),
-[0x05] = SSSE3_OP(phsubw),
-[0x06] = SSSE3_OP(phsubd),
-[0x07] = SSSE3_OP(phsubsw),
-[0x08] = SSSE3_OP(psignb),
-[0x09] = SSSE3_OP(psignw),
-[0x0a] = SSSE3_OP(psignd),
-[0x0b] = SSSE3_OP(pmulhrsw),
-[0x10] = SSE41_OP(pblendvb),
-[0x14] = SSE41_OP(blendvps),
-[0x15] = SSE41_OP(blendvpd),
-[0x17] = SSE41_OP(ptest),
-[0x1c] = SSSE3_OP(pabsb),
-[0x1d] = SSSE3_OP(pabsw),
-[0x1e] = SSSE3_OP(pabsd),
-[0x20] = SSE41_OP(pmovsxbw),
-[0x21] = SSE41_OP(pmovsxbd),
-[0x22] = SSE41_OP(pmovsxbq),
-[0x23] = SSE41_OP(pmovsxwd),
-[0x24] = SSE41_OP(pmovsxwq),
-[0x25] = SSE41_OP(pmovsxdq),
-[0x28] = SSE41_OP(pmuldq),
-[0x29] = SSE41_OP(pcmpeqq),
-[0x2a] = SSE41_SPECIAL, /* movntqda */
-[0x2b] = SSE41_OP(packusdw),
-[0x30] = SSE41_OP(pmovzxbw),
-[0x31] = SSE41_OP(pmovzxbd),
-[0x32] = SSE41_OP(pmovzxbq),
-[0x33] = SSE41_OP(pmovzxwd),
-[0x34] = SSE41_OP(pmovzxwq),
-[0x35] = SSE41_OP(pmovzxdq),
-[0x37] = SSE42_OP(pcmpgtq),
-[0x38] = SSE41_OP(pminsb),
-[0x39] = SSE41_OP(pminsd),
-[0x3a] = SSE41_OP(pminuw),
-[0x3b] = SSE41_OP(pminud),
-[0x3c] = SSE41_OP(pmaxsb),
-[0x3d] = SSE41_OP(pmaxsd),
-[0x3e] = SSE41_OP(pmaxuw),
-[0x3f] = SSE41_OP(pmaxud),
-[0x40] = SSE41_OP(pmulld),
-[0x41] = SSE41_OP(phminposuw),
-[0xdb] = AESNI_OP(aesimc),
-[0xdc] = AESNI_OP(aesenc),
-[0xdd] = AESNI_OP(aesenclast),
-[0xde] = AESNI_OP(aesdec),
-[0xdf] = AESNI_OP(aesdeclast),
+#define gen_helper_special_xmm NULL
+
+#define OP(name, op, flags, ext, mmx_name) \
+{{mmx_name, gen_helper_ ## name ## _xmm}, CPUID_EXT_ ## ext, flags}
+#define BINARY_OP_MMX(name, ext) \
+OP(name, op2, SSE_OPF_MMX, ext, gen_helper_ ## name ## _mmx)
+#define BINARY_OP(name, ext, flags) \
+OP(name, op2, flags, ext, NULL)
+#define UNARY_OP_MMX(name, ext) \
+OP(name, op1, SSE_OPF_V0 | SSE_OPF_MMX, ext, gen_helper_ ## name ## _mmx)
+#define UNARY_OP(name, ext, flags) \
+OP(name, op1, SSE_OPF_V0 | flags, ext, NULL)
+#define BLENDV_OP(name, ext, flags) OP(name, op3, SSE_OPF_BLENDV, ext, NULL)
+#define CMP_OP(name, ext) OP(name, op1, SSE_OPF_CMP | SSE_OPF_V0, ext, NULL)
+#define SPECIAL_OP(ext) OP(special, op1, SSE_OPF_SPECIAL, ext, NULL)
+
+/* prefix [66] 0f 38 */
+static const struct SSEOpHelper_table6 sse_op_table6[256] = {
+[0x00] = BINARY_OP_MMX(pshufb, SSSE3),
+[0x01] = BINARY_OP_MMX(phaddw, SSSE3),
+[0x02] = BINARY_OP_MMX(phaddd, SSSE3),
+[0x03] = BINARY_OP_MMX(phaddsw, SSSE3),
+[0x04] = BINARY_OP_MMX(pmaddubsw, SSSE3),
+[0x05] = BINARY_OP_MMX(phsubw, SSSE3),
+[0x06] = BINARY_OP_MMX(phsubd, SSSE3),
+[0x07] = BINARY_OP_MMX(phsubsw, SSSE3),
+[0x08] = BINARY_OP_MMX(psignb, SSSE3),
+[0x09] = BINARY_OP_MMX(psignw, SSSE3),
+[0x0a] = BINARY_OP_MMX(psignd, SSSE3),
+[0x0b] = BINARY_OP_MMX(pmulhrsw, SSSE3),
+[0x10] = BLENDV_OP(pblendvb, SSE41, SSE_OPF_MMX),
+[0x14] = BLENDV_OP(blendvps, SSE41, 0),
+[0x15] = BLENDV_OP(blendvpd, SSE41, 0),

[PATCH v2 08/42] i386: Add ZMM_OFFSET macro

2022-04-24 Thread Paul Brook

Add a convenience macro to get the address of an xmm_regs element within
CPUX86State.

This was originally going to be the basis of an implementation that broke
operations into 128 bit chunks. I scrapped that idea, so this is now a purely
cosmetic change. But I think a worthwhile one - it reduces the number of
function calls that need to be split over multiple lines.

No functional changes.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 60 +
 1 file changed, 27 insertions(+), 33 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 2f5cc24e0c..e9e6062b7f 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2777,6 +2777,8 @@ static inline void gen_op_movq_env_0(DisasContext *s, int 
d_offset)
 tcg_gen_st_i64(s->tmp1_i64, cpu_env, d_offset);
 }
 
+#define ZMM_OFFSET(reg) offsetof(CPUX86State, xmm_regs[reg])
+
 typedef void (*SSEFunc_i_ep)(TCGv_i32 val, TCGv_ptr env, TCGv_ptr reg);
 typedef void (*SSEFunc_l_ep)(TCGv_i64 val, TCGv_ptr env, TCGv_ptr reg);
 typedef void (*SSEFunc_0_epi)(TCGv_ptr env, TCGv_ptr reg, TCGv_i32 val);
@@ -3329,14 +3331,14 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 if (mod == 3)
 goto illegal_op;
 gen_lea_modrm(env, s, modrm);
-gen_sto_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
+gen_sto_env_A0(s, ZMM_OFFSET(reg));
 break;
 case 0x3f0: /* lddqu */
 CHECK_AVX_V0(s);
 if (mod == 3)
 goto illegal_op;
 gen_lea_modrm(env, s, modrm);
-gen_ldo_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
+gen_ldo_env_A0(s, ZMM_OFFSET(reg));
 break;
 case 0x22b: /* movntss */
 case 0x32b: /* movntsd */
@@ -3375,15 +3377,13 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 #ifdef TARGET_X86_64
 if (s->dflag == MO_64) {
 gen_ldst_modrm(env, s, modrm, MO_64, OR_TMP0, 0);
-tcg_gen_addi_ptr(s->ptr0, cpu_env,
- offsetof(CPUX86State,xmm_regs[reg]));
+tcg_gen_addi_ptr(s->ptr0, cpu_env, ZMM_OFFSET(reg));
 gen_helper_movq_mm_T0_xmm(s->ptr0, s->T0);
 } else
 #endif
 {
 gen_ldst_modrm(env, s, modrm, MO_32, OR_TMP0, 0);
-tcg_gen_addi_ptr(s->ptr0, cpu_env,
- offsetof(CPUX86State,xmm_regs[reg]));
+tcg_gen_addi_ptr(s->ptr0, cpu_env, ZMM_OFFSET(reg));
 tcg_gen_trunc_tl_i32(s->tmp2_i32, s->T0);
 gen_helper_movl_mm_T0_xmm(s->ptr0, s->tmp2_i32);
 }
@@ -3410,11 +3410,10 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 CHECK_AVX_V0(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
-gen_ldo_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
+gen_ldo_env_A0(s, ZMM_OFFSET(reg));
 } else {
 rm = (modrm & 7) | REX_B(s);
-gen_op_movo(s, offsetof(CPUX86State, xmm_regs[reg]),
-offsetof(CPUX86State,xmm_regs[rm]));
+gen_op_movo(s, ZMM_OFFSET(reg), ZMM_OFFSET(rm));
 }
 break;
 case 0x210: /* movss xmm, ea */
@@ -3474,7 +3473,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 CHECK_AVX_V0(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
-gen_ldo_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
+gen_ldo_env_A0(s, ZMM_OFFSET(reg));
 } else {
 rm = (modrm & 7) | REX_B(s);
 gen_op_movl(s, offsetof(CPUX86State, xmm_regs[reg].ZMM_L(0)),
@@ -3519,7 +3518,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 CHECK_AVX_V0(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
-gen_ldo_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
+gen_ldo_env_A0(s, ZMM_OFFSET(reg));
 } else {
 rm = (modrm & 7) | REX_B(s);
 gen_op_movl(s, offsetof(CPUX86State, xmm_regs[reg].ZMM_L(1)),
@@ -3542,8 +3541,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 goto illegal_op;
 field_length = x86_ldub_code(env, s) & 0x3F;
 bit_index = x86_ldub_code(env, s) & 0x3F;
-tcg_gen_addi_ptr(s->ptr0, cpu_env,
-offsetof(CPUX86State,xmm_regs[reg]));
+tcg_gen_addi_ptr(s->ptr0, cpu_env, ZMM_OFFSET(reg));
 if (b1 == 1)
 gen_helper_extrq_i(cpu_env, s->ptr0,
tcg_const_i32(bit_index),
@@ -3617,11 +3615,10 @@ static void

[PATCH v2 07/42] Enforce VEX encoding restrictions

2022-04-24 Thread Paul Brook

Add CHECK_AVX* macros, and use them to validate VEX encoded AVX instructions

All AVX instructions require both CPU and OS support, this is encapsulated
by HF_AVX_EN.

Some also require specific values in the VEX.L and VEX.V fields.
Some (mostly integer operations) also require AVX2

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 159 +---
 1 file changed, 149 insertions(+), 10 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 66ba690b7d..2f5cc24e0c 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3185,10 +3185,54 @@ static const struct SSEOpHelper_table7 
sse_op_table7[256] = {
 goto illegal_op; \
 } while (0)
 
+/*
+ * VEX encodings require AVX
+ * Allow legacy SSE encodings even if AVX not enabled
+ */
+#define CHECK_AVX(s) do { \
+if ((s->prefix & PREFIX_VEX) \
+&& !(env->hflags & HF_AVX_EN_MASK)) \
+goto illegal_op; \
+} while (0)
+
+/* If a VEX prefix is used then it must have V=b */
+#define CHECK_AVX_V0(s) do { \
+CHECK_AVX(s); \
+if ((s->prefix & PREFIX_VEX) && (s->vex_v != 0)) \
+goto illegal_op; \
+} while (0)
+
+/* If a VEX prefix is used then it must have L=0 */
+#define CHECK_AVX_128(s) do { \
+CHECK_AVX(s); \
+if ((s->prefix & PREFIX_VEX) && (s->vex_l != 0)) \
+goto illegal_op; \
+} while (0)
+
+/* If a VEX prefix is used then it must have V=b and L=0 */
+#define CHECK_AVX_V0_128(s) do { \
+CHECK_AVX(s); \
+if ((s->prefix & PREFIX_VEX) && (s->vex_v != 0 || s->vex_l != 0)) \
+goto illegal_op; \
+} while (0)
+
+/* 256-bit (ymm) variants require AVX2 */
+#define CHECK_AVX2_256(s) do { \
+if (s->vex_l && !(s->cpuid_7_0_ebx_features & CPUID_7_0_EBX_AVX2)) \
+goto illegal_op; \
+} while (0)
+
+/* Requires AVX2 and VEX encoding */
+#define CHECK_AVX2(s) do { \
+if ((s->prefix & PREFIX_VEX) == 0 \
+|| !(s->cpuid_7_0_ebx_features & CPUID_7_0_EBX_AVX2)) \
+goto illegal_op; \
+} while (0)
+
 static void gen_sse(CPUX86State *env, DisasContext *s, int b,
 target_ulong pc_start)
 {
-int b1, op1_offset, op2_offset, is_xmm, val;
+int b1, op1_offset, op2_offset, is_xmm, val, scalar_op;
 int modrm, mod, rm, reg;
 struct SSEOpHelper_table1 sse_op;
 struct SSEOpHelper_table6 op6;
@@ -3228,15 +3272,18 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 gen_exception(s, EXCP07_PREX, pc_start - s->cs_base);
 return;
 }
-if (s->flags & HF_EM_MASK) {
-illegal_op:
-gen_illegal_opcode(s);
-return;
-}
-if (is_xmm
-&& !(s->flags & HF_OSFXSR_MASK)
-&& (b != 0x38 && b != 0x3a)) {
-goto unknown_op;
+/* VEX encoded instuctions ignore EM bit. See also CHECK_AVX */
+if (!(s->prefix & PREFIX_VEX)) {
+if (s->flags & HF_EM_MASK) {
+illegal_op:
+gen_illegal_opcode(s);
+return;
+}
+if (is_xmm
+&& !(s->flags & HF_OSFXSR_MASK)
+&& (b != 0x38 && b != 0x3a)) {
+goto unknown_op;
+}
 }
 if (b == 0x0e) {
 if (!(s->cpuid_ext2_features & CPUID_EXT2_3DNOW)) {
@@ -3278,12 +3325,14 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 case 0x1e7: /* movntdq */
 case 0x02b: /* movntps */
 case 0x12b: /* movntps */
+CHECK_AVX_V0(s);
 if (mod == 3)
 goto illegal_op;
 gen_lea_modrm(env, s, modrm);
 gen_sto_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
 break;
 case 0x3f0: /* lddqu */
+CHECK_AVX_V0(s);
 if (mod == 3)
 goto illegal_op;
 gen_lea_modrm(env, s, modrm);
@@ -3291,6 +3340,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 break;
 case 0x22b: /* movntss */
 case 0x32b: /* movntsd */
+CHECK_AVX_V0_128(s);
 if (mod == 3)
 goto illegal_op;
 gen_lea_modrm(env, s, modrm);
@@ -3321,6 +3371,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 break;
 case 0x16e: /* movd xmm, ea */
+CHECK_AVX_V0_128(s);
 #ifdef TARGET_X86_64
 if (s->dflag == MO_64) {
 gen_ldst_modrm(env, s, modrm, MO_64, OR_TMP0, 0);
@@ -3356,6 +3407,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 case 0x128: /* movapd */
 case 0x16f: /* movdqa xmm, ea */
 case 0x26f: /* movdqu xmm, ea */
+CHECK_AVX_V0(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
 gen_ldo_env_A0(s, offsetof(CPUX86State, xmm_regs[reg]));
@@ -3367,6 +3419,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,

[PATCH v2 03/42] Add AVX_EN hflag

2022-04-24 Thread Paul Brook

Add a new hflag bit to determine whether AVX instructions are allowed

Signed-off-by: Paul Brook 
---
 target/i386/cpu.h|  3 +++
 target/i386/helper.c | 12 
 target/i386/tcg/fpu_helper.c |  1 +
 3 files changed, 16 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9661f9fbd1..65200a1917 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -169,6 +169,7 @@ typedef enum X86Seg {
 #define HF_MPX_EN_SHIFT 25 /* MPX Enabled (CR4+XCR0+BNDCFGx) */
 #define HF_MPX_IU_SHIFT 26 /* BND registers in-use */
 #define HF_UMIP_SHIFT   27 /* CR4.UMIP */
+#define HF_AVX_EN_SHIFT 28 /* AVX Enabled (CR4+XCR0) */
 
 #define HF_CPL_MASK  (3 << HF_CPL_SHIFT)
 #define HF_INHIBIT_IRQ_MASK  (1 << HF_INHIBIT_IRQ_SHIFT)
@@ -195,6 +196,7 @@ typedef enum X86Seg {
 #define HF_MPX_EN_MASK   (1 << HF_MPX_EN_SHIFT)
 #define HF_MPX_IU_MASK   (1 << HF_MPX_IU_SHIFT)
 #define HF_UMIP_MASK (1 << HF_UMIP_SHIFT)
+#define HF_AVX_EN_MASK   (1 << HF_AVX_EN_SHIFT)
 
 /* hflags2 */
 
@@ -2035,6 +2037,7 @@ void host_cpuid(uint32_t function, uint32_t count,
 
 /* helper.c */
 void x86_cpu_set_a20(X86CPU *cpu, int a20_state);
+void cpu_sync_avx_hflag(CPUX86State *env);
 
 #ifndef CONFIG_USER_ONLY
 static inline int x86_asidx_from_attrs(CPUState *cs, MemTxAttrs attrs)
diff --git a/target/i386/helper.c b/target/i386/helper.c
index fa409e9c44..30083c9cff 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -29,6 +29,17 @@
 #endif
 #include "qemu/log.h"
 
+void cpu_sync_avx_hflag(CPUX86State *env)
+{
+if ((env->cr[4] & CR4_OSXSAVE_MASK)
+&& (env->xcr0 & (XSTATE_SSE_MASK | XSTATE_YMM_MASK))
+== (XSTATE_SSE_MASK | XSTATE_YMM_MASK)) {
+env->hflags |= HF_AVX_EN_MASK;
+} else{
+env->hflags &= ~HF_AVX_EN_MASK;
+}
+}
+
 void cpu_sync_bndcs_hflags(CPUX86State *env)
 {
 uint32_t hflags = env->hflags;
@@ -209,6 +220,7 @@ void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
 env->hflags = hflags;
 
 cpu_sync_bndcs_hflags(env);
+cpu_sync_avx_hflag(env);
 }
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/target/i386/tcg/fpu_helper.c b/target/i386/tcg/fpu_helper.c
index ebf5e73df9..b391b69635 100644
--- a/target/i386/tcg/fpu_helper.c
+++ b/target/i386/tcg/fpu_helper.c
@@ -2943,6 +2943,7 @@ void helper_xsetbv(CPUX86State *env, uint32_t ecx, 
uint64_t mask)
 
 env->xcr0 = mask;
 cpu_sync_bndcs_hflags(env);
+cpu_sync_avx_hflag(env);
 return;
 
  do_gpf:
-- 
2.36.0

[PATCH v2 06/42] i386: Add CHECK_NO_VEX

2022-04-24 Thread Paul Brook

Reject invalid VEX encodings on MMX instructions.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index 5335b86c01..66ba690b7d 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -3179,6 +3179,12 @@ static const struct SSEOpHelper_table7 
sse_op_table7[256] = {
 #undef BLENDV_OP
 #undef SPECIAL_OP
 
+/* VEX prefix not allowed */
+#define CHECK_NO_VEX(s) do { \
+if (s->prefix & PREFIX_VEX) \
+goto illegal_op; \
+} while (0)
+
 static void gen_sse(CPUX86State *env, DisasContext *s, int b,
 target_ulong pc_start)
 {
@@ -3262,6 +3268,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 b |= (b1 << 8);
 switch(b) {
 case 0x0e7: /* movntq */
+CHECK_NO_VEX(s);
 if (mod == 3) {
 goto illegal_op;
 }
@@ -3297,6 +3304,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 break;
 case 0x6e: /* movd mm, ea */
+CHECK_NO_VEX(s);
 #ifdef TARGET_X86_64
 if (s->dflag == MO_64) {
 gen_ldst_modrm(env, s, modrm, MO_64, OR_TMP0, 0);
@@ -3330,6 +3338,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 break;
 case 0x6f: /* movq mm, ea */
+CHECK_NO_VEX(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
 gen_ldq_env_A0(s, offsetof(CPUX86State, fpregs[reg].mmx));
@@ -3464,6 +3473,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 break;
 case 0x178:
 case 0x378:
+CHECK_NO_VEX(s);
 {
 int bit_index, field_length;
 
@@ -3484,6 +3494,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 break;
 case 0x7e: /* movd ea, mm */
+CHECK_NO_VEX(s);
 #ifdef TARGET_X86_64
 if (s->dflag == MO_64) {
 tcg_gen_ld_i64(s->T0, cpu_env,
@@ -3524,6 +3535,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 gen_op_movq_env_0(s, offsetof(CPUX86State, 
xmm_regs[reg].ZMM_Q(1)));
 break;
 case 0x7f: /* movq ea, mm */
+CHECK_NO_VEX(s);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
 gen_stq_env_A0(s, offsetof(CPUX86State, fpregs[reg].mmx));
@@ -3607,6 +3619,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 offsetof(CPUX86State, xmm_t0.ZMM_L(1)));
 op1_offset = offsetof(CPUX86State,xmm_t0);
 } else {
+CHECK_NO_VEX(s);
 tcg_gen_movi_tl(s->T0, val);
 tcg_gen_st32_tl(s->T0, cpu_env,
 offsetof(CPUX86State, mmx_t0.MMX_L(0)));
@@ -3648,6 +3661,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 break;
 case 0x02a: /* cvtpi2ps */
 case 0x12a: /* cvtpi2pd */
+CHECK_NO_VEX(s);
 gen_helper_enter_mmx(cpu_env);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
@@ -3693,6 +3707,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 case 0x12c: /* cvttpd2pi */
 case 0x02d: /* cvtps2pi */
 case 0x12d: /* cvtpd2pi */
+CHECK_NO_VEX(s);
 gen_helper_enter_mmx(cpu_env);
 if (mod != 3) {
 gen_lea_modrm(env, s, modrm);
@@ -3766,6 +3781,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 tcg_gen_st16_tl(s->T0, cpu_env,
 
offsetof(CPUX86State,xmm_regs[reg].ZMM_W(val)));
 } else {
+CHECK_NO_VEX(s);
 val &= 3;
 tcg_gen_st16_tl(s->T0, cpu_env,
 
offsetof(CPUX86State,fpregs[reg].mmx.MMX_W(val)));
@@ -3805,6 +3821,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 }
 break;
 case 0x2d6: /* movq2dq */
+CHECK_NO_VEX(s);
 gen_helper_enter_mmx(cpu_env);
 rm = (modrm & 7);
 gen_op_movq(s, offsetof(CPUX86State, xmm_regs[reg].ZMM_Q(0)),
@@ -3812,6 +3829,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,
 gen_op_movq_env_0(s, offsetof(CPUX86State, 
xmm_regs[reg].ZMM_Q(1)));
 break;
 case 0x3d6: /* movdq2q */
+CHECK_NO_VEX(s);
 gen_helper_enter_mmx(cpu_env);
 rm = (modrm & 7) | REX_B(s);
 gen_op_movq(s, offsetof(CPUX86State, fpregs[reg & 7].mmx),
@@ -3827,6 +3845,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, 
int b,

[PATCH v2 04/42] i386: Rework sse_op_table1

2022-04-24 Thread Paul Brook

Add a flags field each row in sse_op_table1.

Initially this is only used as a replacement for the magic
SSE_SPECIAL and SSE_DUMMY pointers, the other flags will become relevant
as the rest of the AVX implementation is built out.

Signed-off-by: Paul Brook 
---
 target/i386/tcg/translate.c | 316 +---
 1 file changed, 186 insertions(+), 130 deletions(-)

diff --git a/target/i386/tcg/translate.c b/target/i386/tcg/translate.c
index b7972f0ff5..7fec582358 100644
--- a/target/i386/tcg/translate.c
+++ b/target/i386/tcg/translate.c
@@ -2788,146 +2788,196 @@ typedef void (*SSEFunc_0_ppi)(TCGv_ptr reg_a, 
TCGv_ptr reg_b, TCGv_i32 val);
 typedef void (*SSEFunc_0_eppt)(TCGv_ptr env, TCGv_ptr reg_a, TCGv_ptr reg_b,
TCGv val);
 
-#define SSE_SPECIAL ((void *)1)
-#define SSE_DUMMY ((void *)2)
+#define SSE_OPF_V0(1 << 0) /* vex.v must be b (only 2 operands) */
+#define SSE_OPF_CMP   (1 << 1) /* does not write for first operand */
+#define SSE_OPF_BLENDV(1 << 2) /* blendv* instruction */
+#define SSE_OPF_SPECIAL   (1 << 3) /* magic */
+#define SSE_OPF_3DNOW (1 << 4) /* 3DNow! instruction */
+#define SSE_OPF_MMX   (1 << 5) /* MMX/integer/AVX2 instruction */
+#define SSE_OPF_SCALAR(1 << 6) /* Has SSE scalar variants */
+#define SSE_OPF_AVX2  (1 << 7) /* AVX2 instruction */
+#define SSE_OPF_SHUF  (1 << 9) /* pshufx/shufpx */
+
+#define OP(op, flags, a, b, c, d)   \
+{flags, {a, b, c, d} }
+
+#define MMX_OP(x) OP(op2, SSE_OPF_MMX, \
+gen_helper_ ## x ## _mmx, gen_helper_ ## x ## _xmm, NULL, NULL)
+
+#define SSE_FOP(name) OP(op2, SSE_OPF_SCALAR, \
+gen_helper_##name##ps, gen_helper_##name##pd, \
+gen_helper_##name##ss, gen_helper_##name##sd)
+#define SSE_OP(sname, dname, op, flags) OP(op, flags, \
+gen_helper_##sname##_xmm, gen_helper_##dname##_xmm, NULL, NULL)
+
+struct SSEOpHelper_table1 {
+int flags;
+SSEFunc_0_epp op[4];
+};
 
-#define MMX_OP2(x) { gen_helper_ ## x ## _mmx, gen_helper_ ## x ## _xmm }
-#define SSE_FOP(x) { gen_helper_ ## x ## ps, gen_helper_ ## x ## pd, \
- gen_helper_ ## x ## ss, gen_helper_ ## x ## sd, }
+#define SSE_3DNOW { SSE_OPF_3DNOW }
+#define SSE_SPECIAL { SSE_OPF_SPECIAL }
 
-static const SSEFunc_0_epp sse_op_table1[256][4] = {
+static const struct SSEOpHelper_table1 sse_op_table1[256] = {
 /* 3DNow! extensions */
-[0x0e] = { SSE_DUMMY }, /* femms */
-[0x0f] = { SSE_DUMMY }, /* pf... */
+[0x0e] = SSE_SPECIAL, /* femms */
+[0x0f] = SSE_3DNOW, /* pf... (sse_op_table5) */
 /* pure SSE operations */
-[0x10] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
movups, movupd, movss, movsd */
-[0x11] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
movups, movupd, movss, movsd */
-[0x12] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
movlps, movlpd, movsldup, movddup */
-[0x13] = { SSE_SPECIAL, SSE_SPECIAL },  /* movlps, movlpd */
-[0x14] = { gen_helper_punpckldq_xmm, gen_helper_punpcklqdq_xmm },
-[0x15] = { gen_helper_punpckhdq_xmm, gen_helper_punpckhqdq_xmm },
-[0x16] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL },  /* movhps, movhpd, 
movshdup */
-[0x17] = { SSE_SPECIAL, SSE_SPECIAL },  /* movhps, movhpd */
-
-[0x28] = { SSE_SPECIAL, SSE_SPECIAL },  /* movaps, movapd */
-[0x29] = { SSE_SPECIAL, SSE_SPECIAL },  /* movaps, movapd */
-[0x2a] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
cvtpi2ps, cvtpi2pd, cvtsi2ss, cvtsi2sd */
-[0x2b] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
movntps, movntpd, movntss, movntsd */
-[0x2c] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
cvttps2pi, cvttpd2pi, cvttsd2si, cvttss2si */
-[0x2d] = { SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL, SSE_SPECIAL }, /* 
cvtps2pi, cvtpd2pi, cvtsd2si, cvtss2si */
-[0x2e] = { gen_helper_ucomiss, gen_helper_ucomisd },
-[0x2f] = { gen_helper_comiss, gen_helper_comisd },
-[0x50] = { SSE_SPECIAL, SSE_SPECIAL }, /* movmskps, movmskpd */
-[0x51] = SSE_FOP(sqrt),
-[0x52] = { gen_helper_rsqrtps, NULL, gen_helper_rsqrtss, NULL },
-[0x53] = { gen_helper_rcpps, NULL, gen_helper_rcpss, NULL },
-[0x54] = { gen_helper_pand_xmm, gen_helper_pand_xmm }, /* andps, andpd */
-[0x55] = { gen_helper_pandn_xmm, gen_helper_pandn_xmm }, /* andnps, andnpd 
*/
-[0x56] = { gen_helper_por_xmm, gen_helper_por_xmm }, /* orps, orpd */
-[0x57] = { gen_helper_pxor_xmm, gen_helper_pxor_xmm }, /* xorps, xorpd */
+[0x10] = SSE_SPECIAL, /* movups, movupd, movss, movsd */
+[0x11] = SSE_SPECIAL, /* movups, movupd, movss, movsd */
+[0x12] = SSE_SPECIAL, /* movlps, movlpd, movsldup, movddup */
+[0x13] = SSE_SPECIAL, /* movlps, movlpd */
+[0x14] = SSE_OP(punpckldq, punpcklqdq, op2, 0), /* unpcklps, unpcklpd */
+[0x15] = SSE_OP(punpckhdq, punpckhqdq, op2, 0), /* unpckhps, unpckhpd */
+

[PATCH v2 02/42] i386: DPPS rounding fix

2022-04-24 Thread Paul Brook

The DPPS (Dot Product) instruction is defined to first sum pairs of
intermediate results, then sum those values to get the final result.
i.e. (A+B)+(C+D)

We incrementally sum the results, i.e. ((A+B)+C)+D, which can result
in incorrect rouding.

For consistency, also remove the redundant (but harmless) add operation
from DPPD

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 47 +++
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index 535440f882..a5a48a20f6 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -1934,32 +1934,36 @@ SSE_HELPER_I(helper_pblendw, W, 8, FBLENDP)
 
 void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, uint32_t mask)
 {
-float32 iresult = float32_zero;
+float32 prod, iresult, iresult2;
 
+/*
+ * We must evaluate (A+B)+(C+D), not ((A+B)+C)+D
+ * to correctly round the intermediate results
+ */
 if (mask & (1 << 4)) {
-iresult = float32_add(iresult,
-  float32_mul(d->ZMM_S(0), s->ZMM_S(0),
-  >sse_status),
-  >sse_status);
+iresult = float32_mul(d->ZMM_S(0), s->ZMM_S(0), >sse_status);
+} else {
+iresult = float32_zero;
 }
 if (mask & (1 << 5)) {
-iresult = float32_add(iresult,
-  float32_mul(d->ZMM_S(1), s->ZMM_S(1),
-  >sse_status),
-  >sse_status);
+prod = float32_mul(d->ZMM_S(1), s->ZMM_S(1), >sse_status);
+} else {
+prod = float32_zero;
 }
+iresult = float32_add(iresult, prod, >sse_status);
 if (mask & (1 << 6)) {
-iresult = float32_add(iresult,
-  float32_mul(d->ZMM_S(2), s->ZMM_S(2),
-  >sse_status),
-  >sse_status);
+iresult2 = float32_mul(d->ZMM_S(2), s->ZMM_S(2), >sse_status);
+} else {
+iresult2 = float32_zero;
 }
 if (mask & (1 << 7)) {
-iresult = float32_add(iresult,
-  float32_mul(d->ZMM_S(3), s->ZMM_S(3),
-  >sse_status),
-  >sse_status);
+prod = float32_mul(d->ZMM_S(3), s->ZMM_S(3), >sse_status);
+} else {
+prod = float32_zero;
 }
+iresult2 = float32_add(iresult2, prod, >sse_status);
+iresult = float32_add(iresult, iresult2, >sse_status);
+
 d->ZMM_S(0) = (mask & (1 << 0)) ? iresult : float32_zero;
 d->ZMM_S(1) = (mask & (1 << 1)) ? iresult : float32_zero;
 d->ZMM_S(2) = (mask & (1 << 2)) ? iresult : float32_zero;
@@ -1968,13 +1972,12 @@ void glue(helper_dpps, SUFFIX)(CPUX86State *env, Reg 
*d, Reg *s, uint32_t mask)
 
 void glue(helper_dppd, SUFFIX)(CPUX86State *env, Reg *d, Reg *s, uint32_t mask)
 {
-float64 iresult = float64_zero;
+float64 iresult;
 
 if (mask & (1 << 4)) {
-iresult = float64_add(iresult,
-  float64_mul(d->ZMM_D(0), s->ZMM_D(0),
-  >sse_status),
-  >sse_status);
+iresult = float64_mul(d->ZMM_D(0), s->ZMM_D(0), >sse_status);
+} else {
+iresult = float64_zero;
 }
 if (mask & (1 << 5)) {
 iresult = float64_add(iresult,
-- 
2.36.0

[PATCH v2 01/42] i386: pcmpestr 64-bit sign extension bug

2022-04-24 Thread Paul Brook

The abs1 function in ops_sse.h only works sorrectly when the result fits
in a signed int. This is fine most of the time because we're only dealing
with byte sized values.

However pcmp_elen helper function uses abs1 to calculate the absolute value
of a cpu register. This incorrectly truncates to 32 bits, and will give
the wrong anser for the most negative value.

Fix by open coding the saturation check before taking the absolute value.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index e4d74b814a..535440f882 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -2011,25 +2011,23 @@ SSE_HELPER_Q(helper_pcmpgtq, FCMPGTQ)
 
 static inline int pcmp_elen(CPUX86State *env, int reg, uint32_t ctrl)
 {
-int val;
+target_long val, limit;
 
 /* Presence of REX.W is indicated by a bit higher than 7 set */
 if (ctrl >> 8) {
-val = abs1((int64_t)env->regs[reg]);
+val = (target_long)env->regs[reg];
 } else {
-val = abs1((int32_t)env->regs[reg]);
+val = (int32_t)env->regs[reg];
 }
-
 if (ctrl & 1) {
-if (val > 8) {
-return 8;
-}
+limit = 8;
 } else {
-if (val > 16) {
-return 16;
-}
+limit = 16;
 }
-return val;
+if ((val > limit) || (val < -limit)) {
+return limit;
+}
+return abs1(val);
 }
 
 static inline int pcmp_ilen(Reg *r, uint8_t ctrl)
-- 
2.36.0

[PATCH v2 09/42] i386: Helper macro for 256 bit AVX helpers

2022-04-24 Thread Paul Brook

Once all the code is in place, 256 bit vector helpers will be generated by
including ops_sse.h a third time with SHIFT=2.

The first bit of support for this is to define a YMM_ONLY macro for code that
only apples to 256 bit vectors.  XXM_ONLY code will be executed for both
128 and 256 bit vectors.

Signed-off-by: Paul Brook 
---
 target/i386/ops_sse.h| 8 
 target/i386/ops_sse_header.h | 4 
 2 files changed, 12 insertions(+)

diff --git a/target/i386/ops_sse.h b/target/i386/ops_sse.h
index a5a48a20f6..23daab6b50 100644
--- a/target/i386/ops_sse.h
+++ b/target/i386/ops_sse.h
@@ -24,6 +24,7 @@
 #define Reg MMXReg
 #define SIZE 8
 #define XMM_ONLY(...)
+#define YMM_ONLY(...)
 #define B(n) MMX_B(n)
 #define W(n) MMX_W(n)
 #define L(n) MMX_L(n)
@@ -37,7 +38,13 @@
 #define W(n) ZMM_W(n)
 #define L(n) ZMM_L(n)
 #define Q(n) ZMM_Q(n)
+#if SHIFT == 1
 #define SUFFIX _xmm
+#define YMM_ONLY(...)
+#else
+#define SUFFIX _ymm
+#define YMM_ONLY(...) __VA_ARGS__
+#endif
 #endif
 
 /*
@@ -2337,6 +2344,7 @@ void glue(helper_aeskeygenassist, SUFFIX)(CPUX86State 
*env, Reg *d, Reg *s,
 
 #undef SHIFT
 #undef XMM_ONLY
+#undef YMM_ONLY
 #undef Reg
 #undef B
 #undef W
diff --git a/target/i386/ops_sse_header.h b/target/i386/ops_sse_header.h
index cef28f2aae..7e7f2cee2a 100644
--- a/target/i386/ops_sse_header.h
+++ b/target/i386/ops_sse_header.h
@@ -21,7 +21,11 @@
 #define SUFFIX _mmx
 #else
 #define Reg ZMMReg
+#if SHIFT == 1
 #define SUFFIX _xmm
+#else
+#define SUFFIX _ymm
+#endif
 #endif
 
 #define dh_alias_Reg ptr
-- 
2.36.0

Re: [PATCH v22 0/8] support dirty restraint on vCPU

2022-04-24 Thread Peter Xu

Hi, Yong,

On Mon, Apr 25, 2022 at 12:52:45AM +0800, Hyman wrote:
> Ping.
>  Hi, David and Peter, how do you think this patchset?
>  Is it suitable for queueing ? or is there still something need to be done ?

It keeps looking good to me in general, let's see whether the maintainers
have any comments.  Thanks,

-- 
Peter Xu

Re: [PATCH] hw/crypto: add Allwinner sun4i-ss crypto device

2022-04-24 Thread LABBE Corentin

Le Thu, Apr 21, 2022 at 01:38:00PM +0100, Peter Maydell a écrit :
> On Sun, 10 Apr 2022 at 20:12, Corentin Labbe  wrote:
> >
> > From: Corentin LABBE 
> >
> > The Allwinner A10 has a cryptographic offloader device which
> > could be easily emulated.
> > The emulated device is tested with Linux only as any of BSD does not
> > support it.
> >
> > Signed-off-by: Corentin LABBE 
> 
> Hi; thanks for this patch, and sorry it's taken me a while to get
> to reviewing it.
> 
> (Daniel, I cc'd you since this device model is making use of crypto
> related APIs.)
> 
> Firstly, a note on patch structure. This is quite a large patch,
> and I think it would be useful to split it at least into two parts:
>  (1) add the new device model
>  (2) change the allwinner SoC to create that new device

Hello

I will do it for next iteration

> 
> > diff --git a/docs/system/arm/cubieboard.rst b/docs/system/arm/cubieboard.rst
> > index 344ff8cef9..7836643ba4 100644
> > --- a/docs/system/arm/cubieboard.rst
> > +++ b/docs/system/arm/cubieboard.rst
> > @@ -14,3 +14,4 @@ Emulated devices:
> >  - SDHCI
> >  - USB controller
> >  - SATA controller
> > +- crypto
> > diff --git a/docs/system/devices/allwinner-sun4i-ss.rst 
> > b/docs/system/devices/allwinner-sun4i-ss.rst
> > new file mode 100644
> > index 00..6e7d2142b5
> > --- /dev/null
> > +++ b/docs/system/devices/allwinner-sun4i-ss.rst
> > @@ -0,0 +1,31 @@
> > +Allwinner sun4i-ss
> > +==
> 
> If you create a new rst file in docs, you need to put it into the
> manual by adding it to some table of contents. Otherwise sphinx
> will complain when you build the documentation, and users won't be
> able to find it. (If you pass 'configure' the --enable-docs option
> that will check that you have everything installed to be able to
> build the docs.)
> 
> There are two options here: you can have this document, and
> add it to the toctree in docs/system/device-emulation.rst, and
> make the "crypto" bullet point in cubieboard.rst be a hyperlink to
> the device-emulation.rst file. Or you can compress the information
> down and put it into orangepi.rst.
> 
> > +The ``sun4i-ss`` emulates the Allwinner cryptographic offloader
> > +present on early Allwinner SoCs (A10, A10s, A13, A20, A33)
> > +In qemu only A10 via the cubieboard machine is supported.
> > +
> > +The emulated hardware is capable of handling the following algorithms:
> > +- SHA1 and MD5 hash algorithms
> > +- AES/DES/DES3 in CBC/ECB
> > +- PRNG
> > +
> > +The emulated hardware does not handle yet:
> > +- CTS for AES
> > +- CTR for AES/DES/DES3
> > +- IRQ and DMA mode
> > +Anyway the Linux driver also does not handle them yet.
> > +
> > +The emulation needs a real crypto backend, for the moment only 
> > gnutls/nettle is supported.
> > +So the device emulation needs qemu to be compiled with optionnal gnutls.
> 
> > diff --git a/hw/Kconfig b/hw/Kconfig
> > index ad20cce0a9..43bd7fc14d 100644
> > --- a/hw/Kconfig
> > +++ b/hw/Kconfig
> > @@ -6,6 +6,7 @@ source audio/Kconfig
> >  source block/Kconfig
> >  source char/Kconfig
> >  source core/Kconfig
> > +source crypto/Kconfig
> >  source display/Kconfig
> >  source dma/Kconfig
> >  source gpio/Kconfig
> 
> I don't think we really need a new subdirectory of hw/
> for a single device. If you can find two other devices that
> already exist in QEMU that would also belong in hw/crypto/
> then we can create it. Otherwise just put this device in
> hw/misc.

I plan to add at least one other hw/crypto device (allwinner H3 sun8i-ce).
I have another one already ready (rockchip rk3288) but I delay it since there 
are no related SoC in qemu yet.

> 
> > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> > index 97f3b38019..fd8232b1d4 100644
> > --- a/hw/arm/Kconfig
> > +++ b/hw/arm/Kconfig
> > @@ -317,6 +317,7 @@ config ALLWINNER_A10
> >  select AHCI
> >  select ALLWINNER_A10_PIT
> >  select ALLWINNER_A10_PIC
> > +select ALLWINNER_CRYPTO_SUN4I_SS
> >  select ALLWINNER_EMAC
> >  select SERIAL
> >  select UNIMP
> > diff --git a/hw/arm/allwinner-a10.c b/hw/arm/allwinner-a10.c
> > index 05e84728cb..e9104ee028 100644
> > --- a/hw/arm/allwinner-a10.c
> > +++ b/hw/arm/allwinner-a10.c
> > @@ -23,6 +23,7 @@
> >  #include "hw/misc/unimp.h"
> >  #include "sysemu/sysemu.h"
> >  #include "hw/boards.h"
> > +#include "hw/crypto/allwinner-sun4i-ss.h"
> >  #include "hw/usb/hcd-ohci.h"
> >
> >  #define AW_A10_MMC0_BASE0x01c0f000
> > @@ -32,6 +33,7 @@
> >  #define AW_A10_EMAC_BASE0x01c0b000
> >  #define AW_A10_EHCI_BASE0x01c14000
> >  #define AW_A10_OHCI_BASE0x01c14400
> > +#define AW_A10_CRYPTO_BASE  0x01c15000
> >  #define AW_A10_SATA_BASE0x01c18000
> >  #define AW_A10_RTC_BASE 0x01c20d00
> >
> > @@ -48,6 +50,10 @@ static void aw_a10_init(Object *obj)
> >
> >  object_initialize_child(obj, "emac", >emac, TYPE_AW_EMAC);
> >
> > +#if defined CONFIG_NETTLE
> > +object_initialize_child(obj, "crypto", >crypto,

Re: [PATCH v2 2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS

2022-04-24 Thread Christian Schoenebeck

On Samstag, 23. April 2022 06:33:50 CEST Akihiko Odaki wrote:
> On 2022/04/22 23:06, Christian Schoenebeck wrote:
> > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote:
> >> On 2022/04/22 0:07, Christian Schoenebeck wrote:
> >>> mknod() on macOS does not support creating sockets, so divert to
> >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> >>> was passed with mode argument.
> >>> 
> >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> >>> Signed-off-by: Christian Schoenebeck 
> >>> Reviewed-by: Will Cohen 
> >>> ---
> >>> 
> >>>hw/9pfs/9p-util-darwin.c | 27 ++-
> >>>1 file changed, 26 insertions(+), 1 deletion(-)
> >>> 
> >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> >>> index e24d09763a..39308f2a45 100644
> >>> --- a/hw/9pfs/9p-util-darwin.c
> >>> +++ b/hw/9pfs/9p-util-darwin.c
> >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
> >>> *filename, const char *name,>
> >>> 
> >>> */
> >>>
> >>>#if defined CONFIG_PTHREAD_FCHDIR_NP
> >>> 
> >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode)
> >>> {
> >>> +int fd, err;
> >>> +struct sockaddr_un addr = {
> >>> +.sun_family = AF_UNIX
> >>> +};
> >>> +
> >>> +fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> >>> +if (fd == -1) {
> >>> +return fd;
> >>> +}
> >>> +snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> >> 
> >> It would result in an incorrect path if the path does not fit in
> >> addr.sun_path. It should report an explicit error instead.
> > 
> > Looking at its header file, 'sun_path' is indeed defined on macOS with an
> > oddly small size of only 104 bytes. So yes, I should explicitly handle
> > that
> > error case.
> > 
> > I'll post a v3.
> > 
> >>> +err = bind(fd, (struct sockaddr *) , sizeof(addr));
> >>> +if (err == -1) {
> >>> +goto out;
> >> 
> >> You may close(fd) as soon as bind() returns (before checking the
> >> returned value) and eliminate goto.
> > 
> > Yeah, I thought about that alternative, but found it a bit ugly, and
> > probably also counter-productive in case this function might get extended
> > with more error pathes in future. Not that I would insist on the current
> > solution though.
> 
> I'm happy with the explanation. Thanks.
> 
> >>> +}
> >>> +err = chmod(addr.sun_path, mode);
> >> 
> >> I'm not sure if it is fine to have a time window between bind() and
> >> chmod(). Do you have some rationale?
> > 
> > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
> > serialized and the 9p server controller portion (9p.c) is only running on
> > QEMU main thread, but the actual filesystem driver calls are then
> > dispatched to QEMU worker threads and therefore running concurrently at
> > this point:
> > 
> > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
> > 
> > Similar situation on Linux 9p client side: it handles access to a mounted
> > 9p filesystem concurrently, requests are then serialized by 9p driver on
> > Linux and sent over wire to 9p server (host).
> > 
> > So yes, there might be implications by that short time windows. But could
> > that be exploited on macOS hosts in practice?
> > 
> > The socket file would have mode srwxr-xr-x for a short moment.
> > 
> > For security_model=mapped* this should not be a problem.
> > 
> > For security_model=none|passhrough, in theory, maybe? But how likely is
> > that? If you are using a Linux client for instance, trying to brute-force
> > opening the socket file, the client would send several 9p commands
> > (Twalk, Tgetattr, Topen, probably more). The time window of the two
> > commands above should be much smaller than that and I would expect one of
> > the 9p commands to error out in between.
> > 
> > What would be a viable approach to avoid this issue on macOS?
> 
> It is unlikely that a naive brute-force approach will succeed to
> exploit. The more concerning scenario is that the attacker uses the
> knowledge of the underlying implementation of macOS to cause resource
> contention to widen the window. Whether an exploitation is viable
> depends on how much time you spend digging XNU.
> 
> However, I'm also not sure if it really *has* a race condition. Looking
> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
> s->ops->lstat(). It also results in an entity called "path name based
> fid" in the code, which inherently cannot identify a file when it is
> renamed or recreated.
> 
> If there is some rationale it is safe, it may also be applied to the
> sequence of bind() and chmod(). Can anyone explain the sequence of
> s->ops->mknod() and s->ops->lstat() or path name based fid in general?

You are talking about 9p server's controller level: I don't see something that 
would prevent a concurrent open() during this bind() ... chmod() time window 
unfortunately.

Argument 'fidp' passed to

[PATCH v2 11/11] q800: add default vendor and product information for scsi-cd devices

2022-04-24 Thread Mark Cave-Ayland

The MacOS CDROM driver uses a SCSI INQUIRY command to check that any SCSI CDROMs
detected match a whitelist of vendors and products before adding them to the
list of available devices.

Add known-good default vendor and product information using the existing
compat_prop mechanism so the user doesn't have to use long command lines to set
the qdev properties manually.

Signed-off-by: Mark Cave-Ayland 
---
 hw/m68k/q800.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/m68k/q800.c b/hw/m68k/q800.c
index abb549f8d8..8b34776c8e 100644
--- a/hw/m68k/q800.c
+++ b/hw/m68k/q800.c
@@ -692,6 +692,9 @@ static GlobalProperty hw_compat_q800[] = {
 { "scsi-hd", "product", "  ST225N" },
 { "scsi-hd", "ver", "1.0 " },
 { "scsi-cd", "quirk_mode_sense_rom_force_dbd", "on"},
+{ "scsi-cd", "vendor", "MATSHITA" },
+{ "scsi-cd", "product", "CD-ROM CR-8005" },
+{ "scsi-cd", "ver", "1.0k" },
 };
 static const size_t hw_compat_q800_len = G_N_ELEMENTS(hw_compat_q800);
 
-- 
2.20.1

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Andy Lutomirski

On Fri, Apr 22, 2022, at 3:56 AM, Chao Peng wrote:
> On Tue, Apr 05, 2022 at 06:03:21PM +, Sean Christopherson wrote:
>> On Tue, Apr 05, 2022, Quentin Perret wrote:
>> > On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote:
> Only when the register succeeds, the fd is
> converted into a private fd, before that, the fd is just a normal (shared)
> one. During this conversion, the previous data is preserved so you can put
> some initial data in guest pages (whether the architecture allows this is
> architecture-specific and out of the scope of this patch).

I think this can be made to work, but it will be awkward.  On TDX, for example, 
what exactly are the semantics supposed to be?  An error code if the memory 
isn't all zero?  An error code if it has ever been written?

Fundamentally, I think this is because your proposed lifecycle for these 
memfiles results in a lightweight API but is awkward for the intended use 
cases.  You're proposing, roughly:

1. Create a memfile. 

Now it's in a shared state with an unknown virt technology.  It can be read and 
written.  Let's call this state BRAND_NEW.

2. Bind to a VM.

Now it's an a bound state.  For TDX, for example, let's call the new state 
BOUND_TDX.  In this state, the TDX rules are followed (private memory can't be 
converted, etc).

The problem here is that the BOUND_NEW state allows things that are nonsensical 
in TDX, and the binding step needs to invent some kind of semantics for what 
happens when binding a nonempty memfile.

So I would propose a somewhat different order:

1. Create a memfile.  It's in the UNBOUND state and no operations whatsoever 
are allowed except binding or closing.

2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in the 
initial state appropriate for that VM.

For TDX, this completely bypasses the cases where the data is prepopulated and 
TDX can't handle it cleanly.  For SEV, it bypasses a situation in which data 
might be written to the memory before we find out whether that data will be 
unreclaimable or unmovable.

--

Now I have a question, since I don't think anyone has really answered it: how 
does this all work with SEV- or pKVM-like technologies in which private and 
shared pages share the same address space?  I sounds like you're proposing to 
have a big memfile that contains private and shared pages and to use that same 
memfile as pages are converted back and forth.  IO and even real physical DMA 
could be done on that memfile.  Am I understanding correctly?

If so, I think this makes sense, but I'm wondering if the actual memslot setup 
should be different.  For TDX, private memory lives in a logically separate 
memslot space.  For SEV and pKVM, it doesn't.  I assume the API can reflect 
this straightforwardly.

And the corresponding TDX question: is the intent still that shared pages 
aren't allowed at all in a TDX memfile?  If so, that would be the most direct 
mapping to what the hardware actually does.

--Andy

[PATCH v2 09/11] scsi-disk: allow MODE SELECT block descriptor to set the ROM device block size

2022-04-24 Thread Mark Cave-Ayland

Whilst CDROM drives usually have a 2048 byte sector size, older drives have the
ability to switch between 2048 byte and 512 byte sector sizes by specifying a
block descriptor in the MODE SELECT command.

If a MODE SELECT block descriptor is provided, update the scsi-cd device block
size with the provided value accordingly.

This allows CDROMs to be used with A/UX whose driver only works with a 512 byte
sector size.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c  | 7 +++
 hw/scsi/trace-events | 1 +
 2 files changed, 8 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6991493cf4..41ebbe3045 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1583,6 +1583,13 @@ static void scsi_disk_emulate_mode_select(SCSIDiskReq 
*r, uint8_t *inbuf)
 goto invalid_param;
 }
 
+/* Allow changing the block size of ROM devices */
+if (s->qdev.type == TYPE_ROM && bd_len &&
+p[6] != (s->qdev.blocksize >> 8)) {
+s->qdev.blocksize = p[6] << 8;
+trace_scsi_disk_mode_select_rom_set_blocksize(s->qdev.blocksize);
+}
+
 len -= bd_len;
 p += bd_len;
 
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 25eae9f307..1a021ddae9 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -340,6 +340,7 @@ scsi_disk_dma_command_WRITE(const char *cmd, uint64_t lba, 
int len) "Write %s(se
 scsi_disk_new_request(uint32_t lun, uint32_t tag, const char *line) "Command: 
lun=%d tag=0x%x data=%s"
 scsi_disk_aio_sgio_command(uint32_t tag, uint8_t cmd, uint64_t lba, int len, 
uint32_t timeout) "disk aio sgio: tag=0x%x cmd=0x%x (sector %" PRId64 ", count 
%d) timeout=%u"
 scsi_disk_mode_select_page_truncated(int page, int len, int page_len) "page %d 
expected length %d but received length %d"
+scsi_disk_mode_select_rom_set_blocksize(int blocksize) "set ROM block size to 
%d"
 
 # scsi-generic.c
 scsi_generic_command_complete_noio(void *req, uint32_t tag, int statuc) 
"Command complete %p tag=0x%x status=%d"
-- 
2.20.1

[PATCH v2 08/11] scsi-disk: allow the MODE_PAGE_R_W_ERROR AWRE bit to be changeable for CDROM drives

2022-04-24 Thread Mark Cave-Ayland

A/UX sends a MODE_PAGE_R_W_ERROR command with the AWRE bit set to 0 when 
enumerating
CDROM drives. Since the bit is currently hardcoded to 1 then indicate that the 
AWRE
bit can be changed (even though we don't care about the value) so that
the MODE_PAGE_R_W_ERROR page can be set successfully.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index c657e4f5da..6991493cf4 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1187,6 +1187,10 @@ static int mode_sense_page(SCSIDiskState *s, int page, 
uint8_t **p_outbuf,
 case MODE_PAGE_R_W_ERROR:
 length = 10;
 if (page_control == 1) { /* Changeable Values */
+if (s->qdev.type == TYPE_ROM) {
+/* Automatic Write Reallocation Enabled */
+p[0] = 0x80;
+}
 break;
 }
 p[0] = 0x80; /* Automatic Write Reallocation Enabled */
-- 
2.20.1

Re: [PATCH v22 0/8] support dirty restraint on vCPU

2022-04-24 Thread Hyman


Ping.
 Hi, David and Peter, how do you think this patchset?
 Is it suitable for queueing ? or is there still something need to be 
done ?


Yong

在 2022/4/1 1:49, huang...@chinatelecom.cn 写道:

From: Hyman Huang(黄勇) 

This is v22 of dirtylimit series.
The following is the history of the patchset, since v22 kind of different from
the original version, i made abstracts of changelog:

RFC and v1: 
https://lore.kernel.org/qemu-devel/cover.1637214721.git.huang...@chinatelecom.cn/
v2: 
https://lore.kernel.org/qemu-devel/cover.1637256224.git.huang...@chinatelecom.cn/
v1->v2 changelog:
- rename some function and variables. refactor the original algo of dirtylimit. 
Thanks for
   the comments given by Juan Quintela.
v3: 
https://lore.kernel.org/qemu-devel/cover.1637403404.git.huang...@chinatelecom.cn/
v4: 
https://lore.kernel.org/qemu-devel/cover.1637653303.git.huang...@chinatelecom.cn/
v5: 
https://lore.kernel.org/qemu-devel/cover.1637759139.git.huang...@chinatelecom.cn/
v6: 
https://lore.kernel.org/qemu-devel/cover.1637856472.git.huang...@chinatelecom.cn/
v7: 
https://lore.kernel.org/qemu-devel/cover.1638202004.git.huang...@chinatelecom.cn/
v2->v7 changelog:
- refactor the docs, annotation and fix bugs of the original algo of dirtylimit.
   Thanks for the review given by Markus Armbruster.
v8: 
https://lore.kernel.org/qemu-devel/cover.1638463260.git.huang...@chinatelecom.cn/
v9: 
https://lore.kernel.org/qemu-devel/cover.1638495274.git.huang...@chinatelecom.cn/
v10: 
https://lore.kernel.org/qemu-devel/cover.1639479557.git.huang...@chinatelecom.cn/
v7->v10 changelog:
- introduce a simpler but more efficient algo of dirtylimit inspired by Peter 
Xu.
- keep polishing the annotation suggested by Markus Armbruster.
v11: 
https://lore.kernel.org/qemu-devel/cover.1641315745.git.huang...@chinatelecom.cn/
v12: 
https://lore.kernel.org/qemu-devel/cover.1642774952.git.huang...@chinatelecom.cn/
v13: 
https://lore.kernel.org/qemu-devel/cover.1644506963.git.huang...@chinatelecom.cn/
v10->v13 changelog:
- handle the hotplug/unplug scenario.
- refactor the new algo, split the commit and make the code more clean.
v14: 
https://lore.kernel.org/qemu-devel/cover.1644509582.git.huang...@chinatelecom.cn/
v13->v14 changelog:
- sent by accident.
v15: 
https://lore.kernel.org/qemu-devel/cover.1644976045.git.huang...@chinatelecom.cn/
v16: 
https://lore.kernel.org/qemu-devel/cover.1645067452.git.huang...@chinatelecom.cn/
v17: 
https://lore.kernel.org/qemu-devel/cover.1646243252.git.huang...@chinatelecom.cn/
v14->v17 changelog:
- do some code clean and fix test bug reported by Dr. David Alan Gilbert.
v18: 
https://lore.kernel.org/qemu-devel/cover.1646247968.git.huang...@chinatelecom.cn/
v19: 
https://lore.kernel.org/qemu-devel/cover.1647390160.git.huang...@chinatelecom.cn/
v20: 
https://lore.kernel.org/qemu-devel/cover.1647396907.git.huang...@chinatelecom.cn/
v21: 
https://lore.kernel.org/qemu-devel/cover.1647435820.git.huang...@chinatelecom.cn/
v17->v21 changelog:
- add qtest, fix bug and do code clean.
v21->v22 changelog:
- move the vcpu dirty limit test into migration-test and do some modification 
suggested
   by Peter.

Please review.

Yong.

Abstract


This patchset introduce a mechanism to impose dirty restraint
on vCPU, aiming to keep the vCPU running in a certain dirtyrate
given by user. dirty restraint on vCPU maybe an alternative
method to implement convergence logic for live migration,
which could improve guest memory performance during migration
compared with traditional method in theory.

For the current live migration implementation, the convergence
logic throttles all vCPUs of the VM, which has some side effects.
-'read processes' on vCPU will be unnecessarily penalized
- throttle increase percentage step by step, which seems
   struggling to find the optimal throttle percentage when
   dirtyrate is high.
- hard to predict the remaining time of migration if the
   throttling percentage reachs 99%

to a certain extent, the dirty restraint machnism can fix these
effects by throttling at vCPU granularity during migration.

the implementation is rather straightforward, we calculate
vCPU dirtyrate via the Dirty Ring mechanism periodically
as the commit 0e21bf246 "implement dirty-ring dirtyrate calculation"
does, for vCPU that be specified to impose dirty restraint,
we throttle it periodically as the auto-converge does, once after
throttling, we compare the quota dirtyrate with current dirtyrate,
if current dirtyrate is not under the quota, increase the throttling
percentage until current dirtyrate is under the quota.

this patchset is the basis of implmenting a new auto-converge method
for live migration, we introduce two qmp commands for impose/cancel
the dirty restraint on specified vCPU, so it also can be an independent
api to supply the upper app such as libvirt, which can use it to
implement the convergence logic during live migration, supplemented
with the qmp 'calc-dirty-rate' command or whatever.

we post this

[PATCH v2 10/11] q800: add default vendor and product information for scsi-hd devices

2022-04-24 Thread Mark Cave-Ayland

The Apple HD SC Setup program uses a SCSI INQUIRY command to check that any SCSI
hard disks detected match a whitelist of vendors and products before allowing
the "Initialise" button to prepare an empty disk.

Add known-good default vendor and product information using the existing
compat_prop mechanism so the user doesn't have to use long command lines to set
the qdev properties manually.

Signed-off-by: Mark Cave-Ayland 
---
 hw/m68k/q800.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/m68k/q800.c b/hw/m68k/q800.c
index f27ed01785..abb549f8d8 100644
--- a/hw/m68k/q800.c
+++ b/hw/m68k/q800.c
@@ -688,6 +688,9 @@ static void q800_init(MachineState *machine)
 
 static GlobalProperty hw_compat_q800[] = {
 { "scsi-hd", "quirk_mode_page_apple_vendor", "on"},
+{ "scsi-hd", "vendor", " SEAGATE" },
+{ "scsi-hd", "product", "  ST225N" },
+{ "scsi-hd", "ver", "1.0 " },
 { "scsi-cd", "quirk_mode_sense_rom_force_dbd", "on"},
 };
 static const size_t hw_compat_q800_len = G_N_ELEMENTS(hw_compat_q800);
-- 
2.20.1

[PATCH v2 04/11] q800: implement compat_props to enable quirk_mode_page_apple_vendor for scsi-hd devices

2022-04-24 Thread Mark Cave-Ayland

By default quirk_mode_page_apple_vendor should be enabled for all scsi-hd 
devices
connected to the q800 machine to enable MacOS to detect and use them.

Signed-off-by: Mark Cave-Ayland 
---
 hw/m68k/q800.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/hw/m68k/q800.c b/hw/m68k/q800.c
index 099a758c6f..42bf7bb4f0 100644
--- a/hw/m68k/q800.c
+++ b/hw/m68k/q800.c
@@ -686,6 +686,11 @@ static void q800_init(MachineState *machine)
 }
 }
 
+static GlobalProperty hw_compat_q800[] = {
+{ "scsi-hd", "quirk_mode_page_apple_vendor", "on"},
+};
+static const size_t hw_compat_q800_len = G_N_ELEMENTS(hw_compat_q800);
+
 static void q800_machine_class_init(ObjectClass *oc, void *data)
 {
 MachineClass *mc = MACHINE_CLASS(oc);
@@ -695,6 +700,7 @@ static void q800_machine_class_init(ObjectClass *oc, void 
*data)
 mc->max_cpus = 1;
 mc->block_default_type = IF_SCSI;
 mc->default_ram_id = "m68k_mac.ram";
+compat_props_add(mc->compat_props, hw_compat_q800, hw_compat_q800_len);
 }
 
 static const TypeInfo q800_machine_typeinfo = {
-- 
2.20.1

[PATCH v2 03/11] scsi-disk: add MODE_PAGE_APPLE_VENDOR quirk for Macintosh

2022-04-24 Thread Mark Cave-Ayland

One of the mechanisms MacOS uses to identify drives compatible with MacOS is to
send a custom MODE SELECT command for page 0x30 to the drive. The response to
this is a hard-coded manufacturer string which must match in order for the
drive to be usable within MacOS.

Add an implementation of the MODE SELECT page 0x30 response guarded by a newly
defined SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR quirk bit so that drives attached
to non-Apple machines function exactly as before.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c  | 17 +
 include/hw/scsi/scsi.h   |  3 +++
 include/scsi/constants.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index d89cdd4e4a..5de4506b97 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1085,6 +1085,7 @@ static int mode_sense_page(SCSIDiskState *s, int page, 
uint8_t **p_outbuf,
 [MODE_PAGE_R_W_ERROR]  = (1 << TYPE_DISK) | (1 << 
TYPE_ROM),
 [MODE_PAGE_AUDIO_CTL]  = (1 << TYPE_ROM),
 [MODE_PAGE_CAPABILITIES]   = (1 << TYPE_ROM),
+[MODE_PAGE_APPLE_VENDOR]   = (1 << TYPE_ROM),
 };
 
 uint8_t *p = *p_outbuf + 2;
@@ -1229,6 +1230,20 @@ static int mode_sense_page(SCSIDiskState *s, int page, 
uint8_t **p_outbuf,
 p[19] = (16 * 176) & 0xff;
 break;
 
+ case MODE_PAGE_APPLE_VENDOR:
+if (s->quirks & (1 << SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR)) {
+length = 0x24;
+if (page_control == 1) { /* Changeable Values */
+break;
+}
+
+memset(p, 0, length);
+strcpy((char *)p + 8, "APPLE COMPUTER, INC   ");
+break;
+} else {
+return -1;
+}
+
 default:
 return -1;
 }
@@ -3042,6 +3057,8 @@ static Property scsi_hd_properties[] = {
 DEFINE_PROP_UINT16("rotation_rate", SCSIDiskState, rotation_rate, 0),
 DEFINE_PROP_INT32("scsi_version", SCSIDiskState, qdev.default_scsi_version,
   5),
+DEFINE_PROP_BIT("quirk_mode_page_apple_vendor", SCSIDiskState, quirks,
+SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR, 0),
 DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
 DEFINE_PROP_END_OF_LIST(),
 };
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 1ffb367f94..975d462347 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -226,4 +226,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, int 
target, int lun);
 /* scsi-generic.c. */
 extern const SCSIReqOps scsi_generic_req_ops;
 
+/* scsi-disk.c */
+#define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
+
 #endif
diff --git a/include/scsi/constants.h b/include/scsi/constants.h
index 2a32c08b5e..891aa0f45c 100644
--- a/include/scsi/constants.h
+++ b/include/scsi/constants.h
@@ -234,6 +234,7 @@
 #define MODE_PAGE_FAULT_FAIL  0x1c
 #define MODE_PAGE_TO_PROTECT  0x1d
 #define MODE_PAGE_CAPABILITIES0x2a
+#define MODE_PAGE_APPLE_VENDOR0x30
 #define MODE_PAGE_ALLS0x3f
 /* Not in Mt. Fuji, but in ATAPI 2.6 -- deprecated now in favor
  * of MODE_PAGE_SENSE_POWER */
-- 
2.20.1

[PATCH v2 05/11] scsi-disk: add SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD quirk for Macintosh

2022-04-24 Thread Mark Cave-Ayland

During SCSI bus enumeration A/UX sends a MODE SENSE command to the CDROM and
expects the response to include a block descriptor. As per the latest SCSI
documentation, QEMU currently force-disables the block descriptor for CDROM
devices but the A/UX driver expects the block descriptor to always be
returned.

If the block descriptor is not returned in the response then A/UX becomes
confused, since the block descriptor returned in the MODE SENSE response is
used to generate a subsequent MODE SELECT command which is then invalid.

Add a new SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD to allow this behaviour
to be enabled as required.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c| 18 +-
 include/hw/scsi/scsi.h |  1 +
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 5de4506b97..71fdf132c1 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1279,10 +1279,17 @@ static int scsi_disk_emulate_mode_sense(SCSIDiskReq *r, 
uint8_t *outbuf)
 dev_specific_param |= 0x80; /* Readonly.  */
 }
 } else {
-/* MMC prescribes that CD/DVD drives have no block descriptors,
- * and defines no device-specific parameter.  */
-dev_specific_param = 0x00;
-dbd = true;
+if (s->quirks & (1 << SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD)) {
+dev_specific_param = 0x00;
+dbd = false;
+} else {
+/*
+ * MMC prescribes that CD/DVD drives have no block descriptors,
+ * and defines no device-specific parameter.
+ */
+dev_specific_param = 0x00;
+dbd = true;
+}
 }
 
 if (r->req.cmd.buf[0] == MODE_SENSE) {
@@ -1578,7 +1585,6 @@ static void scsi_disk_emulate_mode_select(SCSIDiskReq *r, 
uint8_t *inbuf)
 /* Ensure no change is made if there is an error!  */
 for (pass = 0; pass < 2; pass++) {
 if (mode_select_pages(r, p, len, pass == 1) < 0) {
-assert(pass == 0);
 return;
 }
 }
@@ -3107,6 +3113,8 @@ static Property scsi_cd_properties[] = {
DEFAULT_MAX_IO_SIZE),
 DEFINE_PROP_INT32("scsi_version", SCSIDiskState, qdev.default_scsi_version,
   5),
+DEFINE_PROP_BIT("quirk_mode_sense_rom_force_dbd", SCSIDiskState, quirks,
+SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD, 0),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
index 975d462347..a9e657e03c 100644
--- a/include/hw/scsi/scsi.h
+++ b/include/hw/scsi/scsi.h
@@ -228,5 +228,6 @@ extern const SCSIReqOps scsi_generic_req_ops;
 
 /* scsi-disk.c */
 #define SCSI_DISK_QUIRK_MODE_PAGE_APPLE_VENDOR 0
+#define SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD   1
 
 #endif
-- 
2.20.1

[PATCH v2 06/11] q800: implement compat_props to enable quirk_mode_sense_rom_force_dbd for scsi-cd devices

2022-04-24 Thread Mark Cave-Ayland

By default quirk_mode_sense_rom_force_dbd should be enabled for all scsi-cd 
devices
connected to the q800 machine to correctly report the CDROM block descriptor 
back
to A/UX.

Signed-off-by: Mark Cave-Ayland 
---
 hw/m68k/q800.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/m68k/q800.c b/hw/m68k/q800.c
index 42bf7bb4f0..f27ed01785 100644
--- a/hw/m68k/q800.c
+++ b/hw/m68k/q800.c
@@ -688,6 +688,7 @@ static void q800_init(MachineState *machine)
 
 static GlobalProperty hw_compat_q800[] = {
 { "scsi-hd", "quirk_mode_page_apple_vendor", "on"},
+{ "scsi-cd", "quirk_mode_sense_rom_force_dbd", "on"},
 };
 static const size_t hw_compat_q800_len = G_N_ELEMENTS(hw_compat_q800);
 
-- 
2.20.1

[PATCH v2 07/11] scsi-disk: allow truncated MODE SELECT requests

2022-04-24 Thread Mark Cave-Ayland

When A/UX configures the CDROM device it sends a truncated MODE SELECT request
for page 1 (MODE_PAGE_R_W_ERROR) which is only 6 bytes in length rather than
10. This seems to be due to bug in Apple's code which calculates the CDB message
length incorrectly.

According to [1] this truncated request is accepted on real hardware whereas in
QEMU it generates an INVALID_PARAM_LEN sense code which causes A/UX to get stuck
in a loop retrying the command in an attempt to succeed.

Alter the mode page request length check so that truncated requests are allowed
as per real hardware, adding a trace event to enable the condition to be 
detected.

[1] 
https://68kmla.org/bb/index.php?threads/scsi2sd-project-anyone-interested.29040/page-7#post-316444

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c  | 2 +-
 hw/scsi/trace-events | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 71fdf132c1..c657e4f5da 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1525,7 +1525,7 @@ static int mode_select_pages(SCSIDiskReq *r, uint8_t *p, 
int len, bool change)
 goto invalid_param;
 }
 if (page_len > len) {
-goto invalid_param_len;
+trace_scsi_disk_mode_select_page_truncated(page, page_len, len);
 }
 
 if (!change) {
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index e91b55a961..25eae9f307 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -339,6 +339,7 @@ scsi_disk_dma_command_READ(uint64_t lba, uint32_t len) 
"Read (sector %" PRId64 "
 scsi_disk_dma_command_WRITE(const char *cmd, uint64_t lba, int len) "Write 
%s(sector %" PRId64 ", count %u)"
 scsi_disk_new_request(uint32_t lun, uint32_t tag, const char *line) "Command: 
lun=%d tag=0x%x data=%s"
 scsi_disk_aio_sgio_command(uint32_t tag, uint8_t cmd, uint64_t lba, int len, 
uint32_t timeout) "disk aio sgio: tag=0x%x cmd=0x%x (sector %" PRId64 ", count 
%d) timeout=%u"
+scsi_disk_mode_select_page_truncated(int page, int len, int page_len) "page %d 
expected length %d but received length %d"
 
 # scsi-generic.c
 scsi_generic_command_complete_noio(void *req, uint32_t tag, int statuc) 
"Command complete %p tag=0x%x status=%d"
-- 
2.20.1

[PATCH v2 01/11] scsi-disk: add FORMAT UNIT command

2022-04-24 Thread Mark Cave-Ayland

When initialising a drive ready to install MacOS, Apple HD SC Setup first 
attempts
to format the drive. Add a simple FORMAT UNIT command which simply returns 
success
to allow the format to succeed.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c  | 4 
 hw/scsi/trace-events | 1 +
 2 files changed, 5 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 072686ed58..090679f3b5 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2127,6 +2127,9 @@ static int32_t scsi_disk_emulate_command(SCSIRequest 
*req, uint8_t *buf)
 trace_scsi_disk_emulate_command_WRITE_SAME(
 req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16, r->req.cmd.xfer);
 break;
+case FORMAT_UNIT:
+trace_scsi_disk_emulate_command_FORMAT_UNIT(r->req.cmd.xfer);
+break;
 default:
 trace_scsi_disk_emulate_command_UNKNOWN(buf[0],
 scsi_command_name(buf[0]));
@@ -2533,6 +2536,7 @@ static const SCSIReqOps *const 
scsi_disk_reqops_dispatch[256] = {
 [VERIFY_10]   = _disk_emulate_reqops,
 [VERIFY_12]   = _disk_emulate_reqops,
 [VERIFY_16]   = _disk_emulate_reqops,
+[FORMAT_UNIT] = _disk_emulate_reqops,
 
 [READ_6]  = _disk_dma_reqops,
 [READ_10] = _disk_dma_reqops,
diff --git a/hw/scsi/trace-events b/hw/scsi/trace-events
index 20fb0dc162..e91b55a961 100644
--- a/hw/scsi/trace-events
+++ b/hw/scsi/trace-events
@@ -334,6 +334,7 @@ scsi_disk_emulate_command_UNMAP(size_t xfer) "Unmap (len 
%zd)"
 scsi_disk_emulate_command_VERIFY(int bytchk) "Verify (bytchk %d)"
 scsi_disk_emulate_command_WRITE_SAME(int cmd, size_t xfer) "WRITE SAME %d (len 
%zd)"
 scsi_disk_emulate_command_UNKNOWN(int cmd, const char *name) "Unknown SCSI 
command (0x%2.2x=%s)"
+scsi_disk_emulate_command_FORMAT_UNIT(size_t xfer) "Format Unit (len %zd)"
 scsi_disk_dma_command_READ(uint64_t lba, uint32_t len) "Read (sector %" PRId64 
", count %u)"
 scsi_disk_dma_command_WRITE(const char *cmd, uint64_t lba, int len) "Write 
%s(sector %" PRId64 ", count %u)"
 scsi_disk_new_request(uint32_t lun, uint32_t tag, const char *line) "Command: 
lun=%d tag=0x%x data=%s"
-- 
2.20.1

[PATCH v2 00/11] scsi: add quirks and features to support m68k Macs

2022-04-24 Thread Mark Cave-Ayland

Here are the next set of patches from my ongoing work to allow the q800
machine to boot MacOS related to SCSI devices.

The first patch implements a dummy FORMAT UNIT command which is used by
the Apple HD SC Setup program when preparing an empty disk to install
MacOS.

Patch 2 adds a new quirks bitmap to SCSIDiskState to allow buggy and/or
legacy features to enabled on an individual device basis. Once the quirks
bitmap has been added, patch 3 uses the quirks feature to implement an
Apple-specific mode page which is required to allow the disk to be recognised
and used by Apple HD SC Setup.

Patch 4 adds compat_props to the q800 machine which enable the new
MODE_PAGE_APPLE_VENDOR quirk for all scsi-hd devices attached to the machine.

Patch 5 adds a new quirk to force SCSI CDROMs to always return the block
descriptor for a MODE SENSE command which is expected by A/UX, whilst patch 6
enables the quirk for all scsi-cd devices on the q800 machine.

Patch 7 adds support for truncated MODE SELECT requests which are sent by
A/UX (and also MacOS in some circumstances) when enumerating a SCSI CDROM device
which are shown to be accepted on real hardware as documented in [1].

Patch 8 allows the MODE_PAGE_R_W_ERROR AWRE bit to be changeable since the A/UX
MODE SELECT request sets this bit to 0 rather than the QEMU default which is 1.

Patch 9 adds support for setting the CDROM block size via a MODE SELECT request
which is supported by older CDROMs to allow the block size to be changed from
the default of 2048 bytes to 512 bytes for compatibility purposes. This is used
by A/UX which otherwise fails with SCSI errors if the block size is not set to
512 bytes when accessing CDROMs.

Finally patches 10 and 11 augment the compat_props to set the default vendor,
product and version information for all scsi-hd and scsi-cd devices attached
to the q800 machine, taken from real drives. This is because MacOS will only
allow a known set of SCSI devices to be recognised during the installation
process.

Signed-off-by: Mark Cave-Ayland 

[1] 
https://68kmla.org/bb/index.php?threads/scsi2sd-project-anyone-interested.29040/page-7#post-316444


v2:
- Change patchset title from "scsi: add support for FORMAT UNIT command and 
quirks"
  to "scsi: add quirks and features to support m68k Macs"
- Fix missing shift in patch 2 as pointed out by Fam
- Rename MODE_PAGE_APPLE to MODE_PAGE_APPLE_VENDOR
- Add SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD quirk
- Add support for truncated MODE SELECT requests
- Allow MODE_PAGE_R_W_ERROR AWRE bit to be changeable for CDROM devices
- Allow the MODE SELECT block descriptor to set the CDROM block size


Mark Cave-Ayland (11):
  scsi-disk: add FORMAT UNIT command
  scsi-disk: add new quirks bitmap to SCSIDiskState
  scsi-disk: add MODE_PAGE_APPLE_VENDOR quirk for Macintosh
  q800: implement compat_props to enable quirk_mode_page_apple_vendor
for scsi-hd devices
  scsi-disk: add SCSI_DISK_QUIRK_MODE_SENSE_ROM_FORCE_DBD quirk for
Macintosh
  q800: implement compat_props to enable quirk_mode_sense_rom_force_dbd
for scsi-cd devices
  scsi-disk: allow truncated MODE SELECT requests
  scsi-disk: allow the MODE_PAGE_R_W_ERROR AWRE bit to be changeable for
CDROM drives
  scsi-disk: allow MODE SELECT block descriptor to set the ROM device
block size
  q800: add default vendor and product information for scsi-hd devices
  q800: add default vendor and product information for scsi-cd devices

 hw/m68k/q800.c   | 13 ++
 hw/scsi/scsi-disk.c  | 53 +++-
 hw/scsi/trace-events |  3 +++
 include/hw/scsi/scsi.h   |  4 +++
 include/scsi/constants.h |  1 +
 5 files changed, 68 insertions(+), 6 deletions(-)

-- 
2.20.1

[PATCH v2 02/11] scsi-disk: add new quirks bitmap to SCSIDiskState

2022-04-24 Thread Mark Cave-Ayland

Since the MacOS SCSI implementation is quite old (and Apple added some firmware
customisations to their drives for m68k Macs) there is need to add a mechanism
to correctly handle Apple-specific quirks.

Add a new quirks bitmap to SCSIDiskState that can be used to enable these
features as required.

Signed-off-by: Mark Cave-Ayland 
---
 hw/scsi/scsi-disk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 090679f3b5..d89cdd4e4a 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -94,6 +94,7 @@ struct SCSIDiskState {
 uint16_t port_index;
 uint64_t max_unmap_size;
 uint64_t max_io_size;
+uint32_t quirks;
 QEMUBH *bh;
 char *version;
 char *serial;
-- 
2.20.1

Re: [PATCH 0/6] scsi: add support for FORMAT UNIT command and quirks

2022-04-24 Thread Mark Cave-Ayland


On 21/04/2022 07:51, Mark Cave-Ayland wrote:


Here are the next set of patches from my ongoing work to allow the q800
machine to boot MacOS related to SCSI devices.

The first patch implements a dummy FORMAT UNIT command which is used by
the Apple HD SC Setup program when preparing an empty disk to install
MacOS.

Patches 2 adds a new quirks bitmap to SCSIDiskState to allow buggy and/or
legacy features to enabled on an individual device basis. Once the quirks
bitmap has been added, patch 3 uses the quirks feature to implement an
Apple-specific mode page which is required to allow the disk to be recognised
and used by Apple HD SC Setup.

Patch 4 adds compat_props to the q800 machine which enable the MODE_PAGE_APPLE
quirk for all scsi-hd devices attached to the machine.

Finally patches 5 and 6 augment the compat_props to set the default vendor,
product and version information for all scsi-hd and scsi-cd devices attached
to the q800 machine, taken from real drives. This is because MacOS will only
allow a known set of SCSI devices to be recognised during the installation
process.

Signed-off-by: Mark Cave-Ayland 


Mark Cave-Ayland (6):
   scsi-disk: add FORMAT UNIT command
   scsi-disk: add new quirks bitmap to SCSIDiskState
   scsi-disk: add MODE_PAGE_APPLE quirk for Macintosh
   q800: implement compat_props to enable quirk_mode_page_apple for
 scsi-hd devices
   q800: add default vendor, product and version information for scsi-hd
 devices
   q800: add default vendor, product and version information for scsi-cd
 devices

  hw/m68k/q800.c   | 12 
  hw/scsi/scsi-disk.c  | 24 
  hw/scsi/trace-events |  1 +
  include/hw/scsi/scsi.h   |  3 +++
  include/scsi/constants.h |  1 +
  5 files changed, 41 insertions(+)


I was fortunate enough to find a really good reference to some work done over on 
68mla.org reverse engineering Apple's HD SC Setup and SCSI device detection. This 
pointed me towards a couple of additional SCSI changes for QEMU that also fix CDROM 
access under A/UX which I shall include in an updated v2.



ATB,

Mark.

Re: [PATCH 3/6] scsi-disk: add MODE_PAGE_APPLE quirk for Macintosh

2022-04-24 Thread Mark Cave-Ayland


On 21/04/2022 23:00, BALATON Zoltan wrote:


On Thu, 21 Apr 2022, Richard Henderson wrote:

On 4/21/22 08:29, Mark Cave-Ayland wrote:

You need (1 << SCSI_DISK_QUIRK_MODE_PAGE_APPLE) instead.


Doh, you're absolutely right. I believe the current recommendation is to use the 
BIT() macro in these cases.


I think it's not a recommendation (as in code style) but it often makes things 
simpler by reducing the number of parenthesis so using it is probably a good idea for 
readability. But if you never need the bit number only the value then you could 
define the quirks constants as that in the first place. (Otherwise if you want bit 
numbers maybe make it an enum.)



We probably need to fix BIT() to use 1ULL.

At present it's using 1UL, to match the other (unfortunate) uses of unsigned long 
within bitops.h.  The use of BIT() for things unrelated to bitops.h just bit a 
recent risc-v pull request, in that it failed to build on all 32-bit hosts.


There's already a BIT_ULL(nr) when ULL is needed but in this case quirks was declared 
uint32_t so probably OK with UL as well. (Was this bitops.h taken from Linux? Keeping 
it compatible then may be a good idea to avoid confusion.)


It seems there is still a bit of discussion around using BIT() here, so for v2 I'll 
add the shift directly with (1 << x). Then if the BIT() macro becomes suitable for 
more general use it can easily be updated as a separate patch later.



ATB,

Mark.

Re: [PATCH v2 17/34] configure: move Windows flags detection to meson

2022-04-24 Thread Marc-André Lureau

On Sat, Apr 23, 2022 at 5:09 PM Paolo Bonzini  wrote:

> Signed-off-by: Paolo Bonzini 
>

Reviewed-by: Marc-André Lureau 


> ---
> v1->v2: fix get_option('optimization') comparison to use a string
>
>  configure   | 20 
>  meson.build |  8 
>  2 files changed, 8 insertions(+), 20 deletions(-)
>
> diff --git a/configure b/configure
> index 0b236fda59..a6ba59cf6f 100755
> --- a/configure
> +++ b/configure
> @@ -224,10 +224,6 @@ glob() {
>  eval test -z '"${1#'"$2"'}"'
>  }
>
> -ld_has() {
> -$ld --help 2>/dev/null | grep ".$1" >/dev/null 2>&1
> -}
> -
>  if printf %s\\n "$source_path" "$PWD" | grep -q "[[:space:]:]";
>  then
>error_exit "main directory cannot contain spaces nor colons"
> @@ -2088,22 +2084,6 @@ if test "$solaris" = "no" && test "$tsan" = "no";
> then
>  fi
>  fi
>
> -# Use ASLR, no-SEH and DEP if available
> -if test "$mingw32" = "yes" ; then
> -flags="--no-seh --nxcompat"
> -
> -# Disable ASLR for debug builds to allow debugging with gdb
> -if test "$debug" = "no" ; then
> -flags="--dynamicbase $flags"
> -fi
> -
> -for flag in $flags; do
> -if ld_has $flag ; then
> -QEMU_LDFLAGS="-Wl,$flag $QEMU_LDFLAGS"
> -fi
> -done
> -fi
> -
>  # Guest agent Windows MSI package
>
>  if test "$QEMU_GA_MANUFACTURER" = ""; then
> diff --git a/meson.build b/meson.build
> index 1a9549d90c..d569c6e944 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -182,6 +182,14 @@ qemu_cxxflags = config_host['QEMU_CXXFLAGS'].split()
>  qemu_objcflags = config_host['QEMU_OBJCFLAGS'].split()
>  qemu_ldflags = config_host['QEMU_LDFLAGS'].split()
>
> +if targetos == 'windows'
> +  qemu_ldflags += cc.get_supported_link_arguments('-Wl,--no-seh',
> '-Wl,--nxcompat')
> +  # Disable ASLR for debug builds to allow debugging with gdb
> +  if get_option('optimization') == '0'
> +qemu_ldflags += cc.get_supported_link_arguments('-Wl,--dynamicbase')
> +  endif
> +endif
> +
>  if get_option('gprof')
>qemu_cflags += ['-p']
>qemu_cxxflags += ['-p']
> --
> 2.35.1
>
>
>
>

-- 
Marc-André Lureau

Re: [PATCH v2] error-report: fix g_date_time_format assertion

2022-04-24 Thread Marc-André Lureau

On Sun, Apr 24, 2022 at 3:27 PM Haiyue Wang  wrote:

> The 'g_get_real_time' returns the number of microseconds since January
> 1, 1970 UTC, but 'g_date_time_new_from_unix_utc' needs the number of
> seconds, so it will cause the invalid time input:
>
> (process:279642): GLib-CRITICAL (recursed) **: g_date_time_format:
> assertion 'datetime != NULL' failed
>
> Call function 'g_date_time_new_now_utc' instead, it has the same result
> as 'g_date_time_new_from_unix_utc(g_get_real_time() / G_USEC_PER_SEC)';
>
> Fixes: 73dab893b569 ("error-report: replace deprecated
> g_get_current_time() with glib >= 2.62")
> Signed-off-by: Haiyue Wang 
>

Thanks, my bad
Reviewed-by: Marc-André Lureau 


> ---
> v2: use 'g_date_time_new_now_utc' directly, which handles the time
> zone reference correctly.
> ---
>  util/error-report.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/util/error-report.c b/util/error-report.c
> index dbadaf206d..5edb2e6040 100644
> --- a/util/error-report.c
> +++ b/util/error-report.c
> @@ -173,7 +173,7 @@ static char *
>  real_time_iso8601(void)
>  {
>  #if GLIB_CHECK_VERSION(2,62,0)
> -g_autoptr(GDateTime) dt =
> g_date_time_new_from_unix_utc(g_get_real_time());
> +g_autoptr(GDateTime) dt = g_date_time_new_now_utc();
>  /* ignore deprecation warning, since GLIB_VERSION_MAX_ALLOWED is 2.56
> */
>  #pragma GCC diagnostic push
>  #pragma GCC diagnostic ignored "-Wdeprecated-declarations"
> --
> 2.36.0
>
>
>

-- 
Marc-André Lureau

Re: Possible bug when setting aarch64 watchpoints

2022-04-24 Thread Chris Howard

Sorry, I need to correct my previous post:



If I set

DBGWVR0_EL1 = 1<<23 // ie. 0x0080

and

DBGWCR0_EL1 = 0x17<<24 | 0xFF<<5 | 0b11<<3 | 0b11<<1 | 0b1<<0   // ie. 
MASK = 23 = 0b10111

and then access  memory [0x0080007F]  I get a watchpoint exception. (ie. 
watchpoints ARE working/enabled)

But if I access [0x00800080] I *don’t* get an exception.

**If the MASK field gets set to 0b0111 instead of 0b10111 then only the bottom 
7 bits of the address get masked (instead of 23) and the masked address isn’t 
0x0080, and the exception won’t be triggered.**

(if I *attempt* to set the MASK to 0b1, but it actually gets set to 
0b0, then I get the behaviour quoted below).


> On 24. Apr 2022, at 13:40, Chris Howard  wrote:
> 
> Hi, I’m new to qemu (and even bug-reporting) so apologies in advance…
> 
> The MASK field in DBGWCRx_EL1 is **5** bits wide [28:24].
> 
> In target/arm/kvm64.c I found the line:
> 
> wp.wcr = deposit32(wp.wcr, 24, 4, bits);  // ie **4** bits 
> instead of **5**
> 
> 
> If it’s not copying (or calculating?) the number of bits correctly this would 
> explain the behaviour I’m seeing:
> 
> If I set
> 
> DBGWVR0_EL1 = 0x0080
> 
> and
> 
> DBGWCR0_EL1 = 0x1F<<24 | 0xFF<<5 | 0b11<<3 | 0b11<<1 | 0b1<<0
> 
> and then access  memory [0x00807FFF]  I get a watchpoint exception. (ie. 
> watchpoints ARE working/enabled)
> 
> But if I access [0x00808] I *don’t* get an exception.
> 
> **If the MASK field gets set to 0b instead of 0b1 then only the 
> bottom 15 bits of the address get masked (instead of 31) and the masked 
> address isn’t 0x0080, and the exception won’t be triggered.**
> 
> 
> Unfortunately, changing the 4 to a 5 and recompiling had no effect :-(
> 
> I may well have misunderstood something. :-/
> 
> —Chris

Possible bug when setting aarch64 watchpoints

2022-04-24 Thread Chris Howard

Hi, I’m new to qemu (and even bug-reporting) so apologies in advance…

The MASK field in DBGWCRx_EL1 is **5** bits wide [28:24].

In target/arm/kvm64.c I found the line:

 wp.wcr = deposit32(wp.wcr, 24, 4, bits);   // ie **4** bits 
instead of **5**


If it’s not copying (or calculating?) the number of bits correctly this would 
explain the behaviour I’m seeing:

If I set

DBGWVR0_EL1 = 0x0080

and

DBGWCR0_EL1 = 0x1F<<24 | 0xFF<<5 | 0b11<<3 | 0b11<<1 | 0b1<<0

and then access  memory [0x00807FFF]  I get a watchpoint exception. (ie. 
watchpoints ARE working/enabled)

But if I access [0x00808] I *don’t* get an exception.

**If the MASK field gets set to 0b instead of 0b1 then only the bottom 
15 bits of the address get masked (instead of 31) and the masked address isn’t 
0x0080, and the exception won’t be triggered.**


Unfortunately, changing the 4 to a 5 and recompiling had no effect :-(

I may well have misunderstood something. :-/

—Chris

[PATCH v2] error-report: fix g_date_time_format assertion

2022-04-24 Thread Haiyue Wang

The 'g_get_real_time' returns the number of microseconds since January
1, 1970 UTC, but 'g_date_time_new_from_unix_utc' needs the number of
seconds, so it will cause the invalid time input:

(process:279642): GLib-CRITICAL (recursed) **: g_date_time_format: assertion 
'datetime != NULL' failed

Call function 'g_date_time_new_now_utc' instead, it has the same result
as 'g_date_time_new_from_unix_utc(g_get_real_time() / G_USEC_PER_SEC)';

Fixes: 73dab893b569 ("error-report: replace deprecated g_get_current_time() 
with glib >= 2.62")
Signed-off-by: Haiyue Wang 
---
v2: use 'g_date_time_new_now_utc' directly, which handles the time
zone reference correctly.
---
 util/error-report.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/error-report.c b/util/error-report.c
index dbadaf206d..5edb2e6040 100644
--- a/util/error-report.c
+++ b/util/error-report.c
@@ -173,7 +173,7 @@ static char *
 real_time_iso8601(void)
 {
 #if GLIB_CHECK_VERSION(2,62,0)
-g_autoptr(GDateTime) dt = g_date_time_new_from_unix_utc(g_get_real_time());
+g_autoptr(GDateTime) dt = g_date_time_new_now_utc();
 /* ignore deprecation warning, since GLIB_VERSION_MAX_ALLOWED is 2.56 */
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wdeprecated-declarations"
-- 
2.36.0

[PATCH v1] error-report: fix g_date_time_format assertion

2022-04-24 Thread Haiyue Wang

The 'g_get_real_time' returns the number of microseconds since January
1, 1970 UTC, but 'g_date_time_new_from_unix_utc' needs the number of
seconds, so it will cause the invalid time input:

(process:279642): GLib-CRITICAL (recursed) **: g_date_time_format: assertion 
'datetime != NULL' failed

Call 'g_date_time_new_now' with UTC time zone, it has the same result as
'g_date_time_new_from_unix_utc(g_get_real_time()/1e6)';

Fixes: 73dab893b569 ("error-report: replace deprecated g_get_current_time() 
with glib >= 2.62")
Signed-off-by: Haiyue Wang 
---
 util/error-report.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util/error-report.c b/util/error-report.c
index dbadaf206d..4000fff14a 100644
--- a/util/error-report.c
+++ b/util/error-report.c
@@ -173,7 +173,7 @@ static char *
 real_time_iso8601(void)
 {
 #if GLIB_CHECK_VERSION(2,62,0)
-g_autoptr(GDateTime) dt = g_date_time_new_from_unix_utc(g_get_real_time());
+g_autoptr(GDateTime) dt = g_date_time_new_now(g_time_zone_new_utc());
 /* ignore deprecation warning, since GLIB_VERSION_MAX_ALLOWED is 2.56 */
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Wdeprecated-declarations"
-- 
2.36.0

Re: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag

2022-04-24 Thread Chao Peng

On Fri, Apr 22, 2022 at 10:43:50PM -0700, Vishal Annapurve wrote:
> On Thu, Mar 10, 2022 at 6:09 AM Chao Peng  wrote:
> >
> > From: "Kirill A. Shutemov" 
> >
> > Introduce a new memfd_create() flag indicating the content of the
> > created memfd is inaccessible from userspace through ordinary MMU
> > access (e.g., read/write/mmap). However, the file content can be
> > accessed via a different mechanism (e.g. KVM MMU) indirectly.
> >
> > It provides semantics required for KVM guest private memory support
> > that a file descriptor with this flag set is going to be used as the
> > source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV but may not be accessible from host userspace.
> >
> > Since page migration/swapping is not yet supported for such usages
> > so these pages are currently marked as UNMOVABLE and UNEVICTABLE
> > which makes them behave like long-term pinned pages.
> >
> > The flag can not coexist with MFD_ALLOW_SEALING, future sealing is
> > also impossible for a memfd created with this flag.
> >
> > At this time only shmem implements this flag.
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> >  include/linux/shmem_fs.h   |  7 +
> >  include/uapi/linux/memfd.h |  1 +
> >  mm/memfd.c | 26 +++--
> >  mm/shmem.c | 57 ++
> >  4 files changed, 88 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> > index e65b80ed09e7..2dde843f28ef 100644
> > --- a/include/linux/shmem_fs.h
> > +++ b/include/linux/shmem_fs.h
> > @@ -12,6 +12,9 @@
> >
> >  /* inode in-kernel data */
> >
> > +/* shmem extended flags */
> > +#define SHM_F_INACCESSIBLE 0x0001  /* prevent ordinary MMU access 
> > (e.g. read/write/mmap) to file content */
> > +
> >  struct shmem_inode_info {
> > spinlock_t  lock;
> > unsigned intseals;  /* shmem seals */
> > @@ -24,6 +27,7 @@ struct shmem_inode_info {
> > struct shared_policypolicy; /* NUMA memory alloc policy 
> > */
> > struct simple_xattrsxattrs; /* list of xattrs */
> > atomic_tstop_eviction;  /* hold when working on 
> > inode */
> > +   unsigned intxflags; /* shmem extended flags */
> > struct inodevfs_inode;
> >  };
> >
> > @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name,
> > loff_t size, unsigned long flags);
> >  extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
> > unsigned long flags);
> > +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size,
> > +   unsigned long flags,
> > +   unsigned int xflags);
> >  extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
> > const char *name, loff_t size, unsigned long flags);
> >  extern int shmem_zero_setup(struct vm_area_struct *);
> > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> > index 7a8a26751c23..48750474b904 100644
> > --- a/include/uapi/linux/memfd.h
> > +++ b/include/uapi/linux/memfd.h
> > @@ -8,6 +8,7 @@
> >  #define MFD_CLOEXEC0x0001U
> >  #define MFD_ALLOW_SEALING  0x0002U
> >  #define MFD_HUGETLB0x0004U
> > +#define MFD_INACCESSIBLE   0x0008U
> >
> >  /*
> >   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 9f80f162791a..74d45a26cf5d 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, 
> > unsigned long arg)
> >  #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> >  #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> > +  MFD_INACCESSIBLE)
> >
> >  SYSCALL_DEFINE2(memfd_create,
> > const char __user *, uname,
> > unsigned int, flags)
> >  {
> > +   struct address_space *mapping;
> > unsigned int *file_seals;
> > +   unsigned int xflags;
> > struct file *file;
> > int fd, error;
> > char *name;
> > +   gfp_t gfp;
> > long len;
> >
> > if (!(flags & MFD_HUGETLB)) {
> > @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create,
> > return -EINVAL;
> > }
> >
> > +   /* Disallow sealing when MFD_INACCESSIBLE is set. */
> > +   if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING)
> > +   return -EINVAL;
> > +
> > /* length includes terminating zero */
> >

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-04-24 Thread Chao Peng

On Fri, Apr 22, 2022 at 01:06:25PM +0200, Paolo Bonzini wrote:
> On 4/22/22 12:56, Chao Peng wrote:
> >  /* memfile notifier flags */
> >  #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in 
> > the file is inaccessible from userspace (e.g. read/write/mmap) */
> >  #define MFN_F_UNMOVABLE   0x0002  /* memory allocated in 
> > the file is unmovable */
> >  #define MFN_F_UNRECLAIMABLE   0x0003  /* memory allocated in 
> > the file is unreclaimable (e.g. via kswapd or any other pathes) */
> 
> You probably mean BIT(0/1/2) here.

Right, it's BIT(n), Thanks.

Chao
> 
> Paolo
> 
> >  When memfile_notifier is being registered, memfile_register_notifier 
> > will
> >  need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
> >  previous mmap-ed mapping exists on the fd (I'm still unclear on how to 
> > do
> >  this). When multiple consumers are supported it also need check all
> >  registered consumers to see if any conflict (e.g. all consumers should 
> > have
> >  MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd 
> > is
> >  converted into a private fd, before that, the fd is just a normal 
> > (shared)
> >  one. During this conversion, the previous data is preserved so you can 
> > put
> >  some initial data in guest pages (whether the architecture allows this 
> > is
> >  architecture-specific and out of the scope of this patch).

Re: [PATCH v2 1/1] hw/i386/amd_iommu: Fix IOMMU event log encoding errors

2022-04-24 Thread Jason Wang

On Fri, Apr 22, 2022 at 1:52 PM Wei Huang  wrote:
>
> Coverity issues several UNINIT warnings against amd_iommu.c [1]. This
> patch fixes them by clearing evt before encoding. On top of it, this
> patch changes the event log size to 16 bytes per IOMMU specification,
> and fixes the event log entry format in amdvi_encode_event().
>
> [1] CID 1487116/1487200/1487190/1487232/1487115/1487258
>
> Reported-by: Peter Maydell 
> Signed-off-by: Wei Huang 
> ---

Acked-by: Jason Wang 

>  hw/i386/amd_iommu.c | 24 ++--
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
> index ea8eaeb330b6..725f69095b9e 100644
> --- a/hw/i386/amd_iommu.c
> +++ b/hw/i386/amd_iommu.c
> @@ -201,15 +201,18 @@ static void amdvi_setevent_bits(uint64_t *buffer, 
> uint64_t value, int start,
>  /*
>   * AMDVi event structure
>   *0:15   -> DeviceID
> - *55:63  -> event type + miscellaneous info
> - *63:127 -> related address
> + *48:63  -> event type + miscellaneous info
> + *64:127 -> related address
>   */
>  static void amdvi_encode_event(uint64_t *evt, uint16_t devid, uint64_t addr,
> uint16_t info)
>  {
> +evt[0] = 0;
> +evt[1] = 0;
> +
>  amdvi_setevent_bits(evt, devid, 0, 16);
> -amdvi_setevent_bits(evt, info, 55, 8);
> -amdvi_setevent_bits(evt, addr, 63, 64);
> +amdvi_setevent_bits(evt, info, 48, 16);
> +amdvi_setevent_bits(evt, addr, 64, 64);
>  }
>  /* log an error encountered during a page walk
>   *
> @@ -218,7 +221,7 @@ static void amdvi_encode_event(uint64_t *evt, uint16_t 
> devid, uint64_t addr,
>  static void amdvi_page_fault(AMDVIState *s, uint16_t devid,
>   hwaddr addr, uint16_t info)
>  {
> -uint64_t evt[4];
> +uint64_t evt[2];
>
>  info |= AMDVI_EVENT_IOPF_I | AMDVI_EVENT_IOPF;
>  amdvi_encode_event(evt, devid, addr, info);
> @@ -234,7 +237,7 @@ static void amdvi_page_fault(AMDVIState *s, uint16_t 
> devid,
>  static void amdvi_log_devtab_error(AMDVIState *s, uint16_t devid,
> hwaddr devtab, uint16_t info)
>  {
> -uint64_t evt[4];
> +uint64_t evt[2];
>
>  info |= AMDVI_EVENT_DEV_TAB_HW_ERROR;
>
> @@ -248,7 +251,8 @@ static void amdvi_log_devtab_error(AMDVIState *s, 
> uint16_t devid,
>   */
>  static void amdvi_log_command_error(AMDVIState *s, hwaddr addr)
>  {
> -uint64_t evt[4], info = AMDVI_EVENT_COMMAND_HW_ERROR;
> +uint64_t evt[2];
> +uint16_t info = AMDVI_EVENT_COMMAND_HW_ERROR;
>
>  amdvi_encode_event(evt, 0, addr, info);
>  amdvi_log_event(s, evt);
> @@ -261,7 +265,7 @@ static void amdvi_log_command_error(AMDVIState *s, hwaddr 
> addr)
>  static void amdvi_log_illegalcom_error(AMDVIState *s, uint16_t info,
> hwaddr addr)
>  {
> -uint64_t evt[4];
> +uint64_t evt[2];
>
>  info |= AMDVI_EVENT_ILLEGAL_COMMAND_ERROR;
>  amdvi_encode_event(evt, 0, addr, info);
> @@ -276,7 +280,7 @@ static void amdvi_log_illegalcom_error(AMDVIState *s, 
> uint16_t info,
>  static void amdvi_log_illegaldevtab_error(AMDVIState *s, uint16_t devid,
>hwaddr addr, uint16_t info)
>  {
> -uint64_t evt[4];
> +uint64_t evt[2];
>
>  info |= AMDVI_EVENT_ILLEGAL_DEVTAB_ENTRY;
>  amdvi_encode_event(evt, devid, addr, info);
> @@ -288,7 +292,7 @@ static void amdvi_log_illegaldevtab_error(AMDVIState *s, 
> uint16_t devid,
>  static void amdvi_log_pagetab_error(AMDVIState *s, uint16_t devid,
>  hwaddr addr, uint16_t info)
>  {
> -uint64_t evt[4];
> +uint64_t evt[2];
>
>  info |= AMDVI_EVENT_PAGE_TAB_HW_ERROR;
>  amdvi_encode_event(evt, devid, addr, info);
> --
> 2.35.1
>

79 matches

Mail list logo