Re: loading of zeros into {x,y,z}mm registers
Hello Richard, On 01 Dec 12:44, Richard Biener wrote: > On Fri, Dec 1, 2017 at 6:45 AM, Kirill Yukhin wrote: > > Hello Jan, > > On 29 Nov 08:59, Jan Beulich wrote: > >> Kirill, > >> > >> in an unrelated context I've stumbled across a change of yours > >> from Aug 2014 (revision 213847) where you "extend" the ways > >> of loading zeros into registers. I don't understand why this was > >> done, and the patch submission mail also doesn't give any reason. > >> My point is that simple VEX-encoded vxorps/vxorpd/vpxor with > >> 128-bit register operands ought to be sufficient to zero any width > >> registers, due to the zeroing of the high parts the instructions do. > >> Hence by using EVEX encoded insns it looks like all you do is grow > >> the instruction length by one or two bytes (besides making the > >> source somewhat more complicated to follow). At the very least > >> the shorter variants should be used for -Os imo. > > As far as I can recall, this was done since we cannot load zeroes > > into upper 16 MM registers, which are available in EVEX exclusively. > > Note on Zen pxor on %ymm also takes double amount of resources > as that on %xmm. > > It would be nice to fix this (and maybe also factor the ability to > reference upper 16 MM regs in costing during RA ...?). I think this is not bad idea: to replace insn when we know, that MM regN is less than 16. > > Richard. > > >> > >> Thanks for any insight, > >> Jan > >> > > > > -- > > Thanks, K -- Thanks, K
Re: loading of zeros into {x,y,z}mm registers
Hello Jan, On 29 Nov 08:59, Jan Beulich wrote: > Kirill, > > in an unrelated context I've stumbled across a change of yours > from Aug 2014 (revision 213847) where you "extend" the ways > of loading zeros into registers. I don't understand why this was > done, and the patch submission mail also doesn't give any reason. > My point is that simple VEX-encoded vxorps/vxorpd/vpxor with > 128-bit register operands ought to be sufficient to zero any width > registers, due to the zeroing of the high parts the instructions do. > Hence by using EVEX encoded insns it looks like all you do is grow > the instruction length by one or two bytes (besides making the > source somewhat more complicated to follow). At the very least > the shorter variants should be used for -Os imo. As far as I can recall, this was done since we cannot load zeroes into upper 16 MM registers, which are available in EVEX exclusively. > > Thanks for any insight, > Jan > -- Thanks, K
[RFC, VECTOR ABI] Allow __attribute__((vector)) in GCC by default.
Hello, Recently vector ABI was introduced into GCC Vector versions of math functions were incorporated in to GlibC starting from v2.22. Unfortunately, to get this functions work `-fopenmp' switch must be added to compiler invocation. This is due to the fact that vector variant of math functions generated using `omp declare simd' pragma. There's an alternative to use __attribute__((vector)) for function. Currently it's enabled under `-fcilkplus' switch. To enable vectorization of loops w/ calls to math functions it is reasonable to enable parsing of attribute vector for functions unconditionally and change GlibC's header file not to use `omp declare simd', but use __attribute__((vector)) instead. If community have no pre-denial, I'll prepare a patch for GCC main trunk & GlibC. -- Thanks, K
Re: Offloading GSOC 2015
Hello Güray, On 20 Mar 12:14, guray ozen wrote: > I've started to prepare my gsoc proposal for gcc's openmp for gpus. I think that here is wide range for exploration. As you know, OpenMP 4 contains vectorization pragmas (`pragma omp simd') which not perfectly suites for GPGPU. Another problem is how to create threads dynamically on GPGPU. As far as we understand it there're two possible solutions: 1. Use dynamic parallelism available in recent API (launch new kernel from target) 2. Estimate maximum thread number on host and start them all from host, making unused threads busy-waiting There's a paper which investigates both approaches [1], [2]. > However i'm little bit confused about which ideas, i mentioned last my > mail, should i propose or which one of them is interesting for gcc. > I'm willing to work on data clauses to enhance performance of shared > memory. Or maybe it might be interesting to work on OpenMP 4.1 draft > version. How do you think i should propose idea? We're going to work on OpenMP 4.1 offloading features. [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf [2] - http://dl.acm.org/citation.cfm?id=2688364 -- Thanks, K
Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC
Hi Tobias, On 13 Nov 16:15, Tobias Burnus wrote: > Kirill Yukhin wrote: > > Support of OpenMP 4.0 offloading to future Xeon Phi was > > fully checked in to main trunk. > > Thanks. If I understood it correctly: > > * GCC 5 supports code generation for Xeon Phi (Knights Landing, KNL) Right. > * KNL (the hardware) is not yet available [mid 2015?] Yes, but I don't know the date. > * liboffloadmic supports offloading in an emulation mode (executed on > the host) but does not (yet) support offloading to KNL; i.e. one > would need an updated version of it, once one gets hold of the > actual hardware. Yes, it supports emulation mode. Also, current scheme is the same as for KNC (however we have no code generator in GCC main trunk for KNC). We're going to keep liboffloadmic up-to-date. > * The current hardware (Xeon Phi Knights Corner, KNC) is and will not > be supported by GCC. Currently GCC main trunk doesn't support KNC code gen. > * Details for building GCC for offloading and running code on an > accelerator is at https://gcc.gnu.org/wiki/Offloading > > Question: Is the latter up to date - and the item above correct? Correct. > BTW: you could update gcc.gnu.org ->news and gcc.gnu.org/gcc-5/changes.html Thanks, I'll post a patch. -- Thanks, K
Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC
Hi Tobias, On 13 Nov 16:15, Tobias Burnus wrote: > Kirill Yukhin wrote: > > Support of OpenMP 4.0 offloading to future Xeon Phi was > > fully checked in to main trunk. > > Thanks. If I understood it correctly: > > * GCC 5 supports code generation for Xeon Phi (Knights Landing, KNL) Right. > * KNL (the hardware) is not yet available [mid 2015?] Yes, but I don't know the date. > * liboffloadmic supports offloading in an emulation mode (executed on > the host) but does not (yet) support offloading to KNL; i.e. one > would need an updated version of it, once one gets hold of the > actual hardware. Yes, it supports emulation mode. Also, current scheme is the same as for KNC (however we have no code generator in GCC main trunk for KNC). We're going to keep liboffloadmic up-to-date. > * The current hardware (Xeon Phi Knights Corner, KNC) is and will not > be supported by GCC. Currently GCC main trunk doesn't support KNC code gen. > * Details for building GCC for offloading and running code on an > accelerator is at https://gcc.gnu.org/wiki/Offloading > > Question: Is the latter up to date - and the item above correct? Correct. > BTW: you could update gcc.gnu.org ->news and gcc.gnu.org/gcc-5/changes.html Thanks, I'll post a patch. -- Thanks, K
Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC
Hello, Support of OpenMP 4.0 offloading to future Xeon Phi was fully checked in to main trunk. Thanks everybody who helped w/ development and review. -- Thanks, K
Re: Offload Library
Hello David, On 20 Jun 14:46, David Edelsohn wrote: > On Fri, May 16, 2014 at 7:47 AM, Kirill Yukhin > wrote: > > Does this look OK? > > The GCC SC has decided to allow this library in the GCC sources. Great news, thanks! > If the library is not going to be expanded to support all GPUs and > offload targets, the library name should be more specific to Intel. Sure, we'll prepare updated source and starty review nearest days. -- Thanks, K
Re: Offload Library
Hello, On 19 May 16:53, Kirill Yukhin wrote: > Hello Ian, > On 16 May 07:07, Ian Lance Taylor wrote: > > On Fri, May 16, 2014 at 4:47 AM, Kirill Yukhin > > wrote: > > > > > > To support the offloading features for Intel's Xeon Phi cards > > > we need to add a foreign library (liboffload) into the gcc repository. > > > README with build instructions is attached. > > > > Can you explain why this library should be part of GCC, and how GCC > > would use it? I'm sure it's obvious to you but it's not obvious to > > me. > The ‘target’ clause of OpenMP 4.0 aka ‘offloading’ support is expected to be > a part of > libgomp. Every target platform that will be supported should implement a > dedicated > plugin for libgomp. The plugin for Xeon PHI is based on the liboffload > functionality. > This library also will provide compatibility for binaries built with ICC. Hello, Let me provide some tech details. This library is compiler-specific library, whose origin goes to ICC. We think that as far as this library is a compiler-related, it is better to put it in GCC source tree. Also, we plan to accompany this library with an emulator of lower level interface. Having such a library and an emulator, we’ll eliminate all external dependencies for offloading to work. So, it will be possible to do `make check’ for offload tests w/o need to link in any external libraries, no offload HW will be needed as well. The intended use of this library as follows. Implementation of libgomp’s interface routines for offload (like `GOMP_target’) will call target-dependent part of the library. This part (for COI-capable software stacks) will call liboffload routines to perform any offload-related tasks. This library is used by ICC and we plan to use it for offloading in LLVM as well. That will make sure that offloading features for these compiler will be binary compatible. The library is going to be built as shared object. -- Thanks, K
Re: Offload Library
Hello, Thomas! On 16 May 19:30, Thomas Schwinge wrote: > On Fri, 16 May 2014 15:47:58 +0400, Kirill Yukhin > wrote: > > To support the offloading features for Intel's Xeon Phi cards > > we need to add a foreign library (liboffload) into the gcc repository. > > As written in the README, this library currently is specific to Intel > hardware (understandably, of course), and I assume also in the future is > to remain that way (?) -- should it thus get a more specific name in GCC, > than the generic liboffload? Yes, this library generates calls to Intel specific Coprocessor offload Interface (COI). I think, that name of library maybe changed, and when I’ll submit the patch We'll discuss it. > > Additionally to that sources we going to add few headers [...] > > and couple of new sources > > For interfacing with GCC, presumably. You haven't stated it explicitly, > but do I assume right that this work will be going onto the > gomp-4_0-branch, integrated with the offloading work developed there, as > a plugin for libgomp? Not exactly. I was talking about COI emulator, which will allow to perform testing of offload w/o any external library dependency and HW. Libgomp <-> liboffload plug-in is also ready, but it need no such an approval, so it’ll be submitted as separate patch. -- Thanks, K > Grüße, > Thomas
Re: Offload Library
Hello Ian, On 16 May 07:07, Ian Lance Taylor wrote: > On Fri, May 16, 2014 at 4:47 AM, Kirill Yukhin > wrote: > > > > To support the offloading features for Intel's Xeon Phi cards > > we need to add a foreign library (liboffload) into the gcc repository. > > README with build instructions is attached. > > Can you explain why this library should be part of GCC, and how GCC > would use it? I'm sure it's obvious to you but it's not obvious to > me. The ‘target’ clause of OpenMP 4.0 aka ‘offloading’ support is expected to be a part of libgomp. Every target platform that will be supported should implement a dedicated plugin for libgomp. The plugin for Xeon PHI is based on the liboffload functionality. This library also will provide compatibility for binaries built with ICC. -- Thanks, K > > Ian
Offload Library
Dear steering committee, To support the offloading features for Intel's Xeon Phi cards we need to add a foreign library (liboffload) into the gcc repository. README with build instructions is attached. I am also copy-pasting the header comment from one of the liboffload files. The header shown below will be in all the source files in liboffload. Sources can be downloaded from [1]. Additionally to that sources we going to add few headers (released under GPL v2.1 license) and couple of new sources (license in the bottom of the message). Does this look OK? [1] - https://www.openmprtl.org/sites/default/files/liboffload_oss.tgz -- Thanks, K /* Copyright (c) 2014 Intel Corporation. All Rights Reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ README for Intel(R) Offload Runtime Library === How to Build Documentation == The main documentation is in Doxygen* format, and this distribution should come with pre-built PDF documentation in doc/Reference.pdf. However, an HTML version can be built by executing: % doxygen doc/doxygen/config in this directory. That will produce HTML documentation in the doc/doxygen/generated directory, which can be accessed by pointing a web browser at the index.html file there. If you don't have Doxygen installed, you can download it from www.doxygen.org. How to Build the Intel(R) Offload Runtime Library = The Makefile at the top-level will attempt to detect what it needs to build the Intel(R) Offload Runtime Library. To see the default settings, type: make info You can change the Makefile's behavior with the following options: root_dir: The path to the top-level directory containing the top-level Makefile. By default, this will take on the value of the current working directory. build_dir:The path to the build directory. By default, this will take on value [root_dir]/build. mpss_dir: The path to the Intel(R) Manycore Platform Software Stack install directory. By default, this will take on the value of operating system's root directory. compiler_host:Which compiler to use for the build of the host part. Defaults to "gcc"*. Also supports "icc" and "clang"*. You should provide the full path to the compiler or it should be in the user's path. compiler_host:Which compiler to use for the build of the target part. Defaults to "gcc"*. Also supports "icc" and "clang"*. You should provide the full path to the compiler or it should be in the user's path. options_host: Additional options for the host compiler. options_target: Additional options for the target compiler. To use any of the options above, simple add =. For example, if you want to build with icc instead of gcc, type: make compiler_host=icc compiler_target=icc Supported RTL Build Configurations == Supported Architectures: Intel(R) 64, and Intel(R) Many Integrated Core Architecture - | icc/icl |gcc |clang| --|---|
[gomp4] Building binaries for offload.
Hello, Let me somewhat summarize current understanding of host binary linking as well as target binary building/linking. We put code which supposed to be offloaded to dedicated sections, with name starting with gnu.target_lto_ At link time (I mean, link time of host app): 1. Generate dedicated data section in each binary (executable or DSO), which'll be a placeholder for offloading stuff. 2. Generate __OPENMP_TARGET__ (weak, hidden) symbol, which'll point to start of the section mentioned in previous item. This section should contain at least: 1. Number of targets 2. Size of offl. symbols table [ Repeat `number of targets'] 2. Name of target 3. Offset to beginning of image to offload to that target 4. Size of image 5. Offl. symbols table Offloading symbols table will contain information about addresses of offloadable symbols in order to create mapping of host<->target addresses at runtime. To get list of target addresses we need to have dedicated interface call to libgomp plugin, something like getTargetAddresses () which will query target for the list of addresses (accompanied with symbol names). To get this information target DSO should contain similar table of mapping symbols to address. Application is going to have single instance of libgomp, which in turn means that we'll have single splay tree holding information about mapping (host -> target) for all DSO and executable. When GOMP_target* is called, pointer to table of current execution module is passed to libgomp along with pointer to routine (or global). libgomp in turn: 1. Verify in splay tree if address of given pointer (to the table) exists. If not - then this means given table is not yet initialized. libgomp initializes it (see below) and insert address of the table in to splay tree. 2. Performs lookup for the address (host) in table provided and extracting target address. 3. After target address is found, we perform API call (passing that address) to given device We have at least 2 approaches of host->target mapping solving. I. Preserve order of symbols appearance. Table row: [ address, size ] For routines, size to be 1 In order to initialize the table we need to get two arrays: of host and target addresses. The order of appearance of objects in these arrays must be the same. Having this makes mapping easy. We just need to find index if given address in array of host addrs and then dereference array of target addresses with index found. The problem is that it unlikely will work when LTO of host is ON. I am also not sure, that order of handling objects on target is the same as on host. II. Store symbol identifier along with address. Table row: [ symbol_name, address, size] For routines, size to be 1 To construct the table of host addresses, at link time we put all symbol (marked at compile time with dedicated attribute) addresses to the table, accompanied with symbol names (they'll serve as keys) During initialization of the table we create host->target address mapping using symbol names as keys. The last thing I wanted to summarize: compiling target code. We have 2 approaches here: 1. Perform WPA and extract sections, marked as target, into separate object file. Then call target compiler on that object file to produce the binary. As mentioned by Jakub, this approach will complicate debugging. 2. Pass fat object files directly to the target compiler (one CU at a time). So, for every object file we are going to call GCC twice: - Host GCC, which will compile all host code for every CU - Target GCC, which will compile all target code for every CU I vote for option #2 as far as WPA-based approach complicates debugging. What do you guys think? -- Thanks, K
[gomp4] GOMP_target fall back execution
Hello, It seems that currently GOMP_target perform call to host variant of the routine: void GOMP_target (int device, void (*fn) (void *), const char *fnname, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) { device = resolve_device (device); if (device == -1) { /* Host fallback. */ fn (hostaddrs); return; } ... } Why not to make GOMP_target return bool and run host fallback on the application site instead of library site. Was: ... GOMP_target (handle, &foo, "foo", ...); ... Proposed to be: ... if (!GOMP_target (handle, &foo, "foo", ...)) foo (hostaddrs); ... The main advantage, IMHO, is that by doing so, we may probably enable inlining of host variant. We also may eliminate pointer to host function and hostaddr from GOMP_target declaration. Having instead of: void GOMP_target (int device, void (*fn) (void *), const char *fnname, size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds) This: bool GOMP_target (int device, const char *fnname, size_t mapnum, size_t *sizes, unsigned char *kinds) What do you think? -- Thanks, K
Re: [RFC] Offloading Support in libgomp
Hello, Adding Richard who might want to take a look at LTO stuff. -- Thanks, K
Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
On 30 Jul 17:55, Kirill Yukhin wrote: > On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote: > > On 07/24/2013 05:23 AM, Richard Biener wrote: > > > "H.J. Lu" wrote: > > > > > >> Hi, > > >> > > >> Here is a patch to extend x86-64 psABI to support AVX-512: > > > > > > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee > > > saved please? Hello, I've implemented a tiny patch on top of `avx512' branch. It makes first 128-bit parts 8 registers of AVX-512 callee saved: xmm16 through xmm23. Here is performance data. It seems we have a little degradation in GEOMEAN. Workload: Spec2006 Dataset: test Options experiment: -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -flto -fwhole-program -mavx512f Options refernece : -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast -funroll-loops -flto -fwhole-program "8 callee-" "icount, all" icount "save icount" "call-clobber" decrease 400.perlbench 1686198567 1682320942 -0.23% 401.bzip2 18983033855 18983033907 0.00% 403.gcc 3999481141 3999095681 -0.01% 410.bwaves 13736672428 13736640026 0.00% 416.gamess 1531782811 1531350122 -0.03% 429.mcf 3079764286 3080957858 0.04% 433.milc14628097067 14628175244 0.00% 434.zeusmp 21336261982 21359384879 0.11% 435.gromacs 3593653152 3588581849 -0.14% 436.cactusADM 2822346689 2828797842 0.23% 437.leslie3d15903712760 15975143040 0.45% 444.namd42446067469 43607637322 2.74% 445.gobmk 35272482208 35268743690 -0.01% 447.dealII 42476324881 42507009849 0.07% 450.soplex 4594315045652666-0.63% 453.povray 2314481169 157619 -3.99% 454.calculix131024939 131078501 0.04% 456.hmmer 13853478444 13853306947 0.00% 458.sjeng 14173066874 14173066909 0.00% 459.GemsFDTD2437559044 2437819638 0.01% 462.libquantum 175827242 175657854 -0.10% 464.h264ref 75718510217 75711714226 -0.01% 465.tonto 2505737844 2511457541 0.23% 470.lbm 4799298802 4812180033 0.27% 473.astar 17435751523 17435498947 0.00% 481.wrf 7144685575 7170593748 0.36% 482.sphinx3 6000198462 5984438416 -0.26% 483.xalancbmk 273958223 273638145 -0.12% GEOMEAN 4678862313 4677012093 -0.04% Bigger % is better, negative mean that we have icount increased after experiment It seems to me that LRA is not always optimal, e.g. if you compile attached testcase with: ./build-x86_64-linux/gcc/xgcc -B./build-x86_64-linux/gcc repro.c -S -Ofast -mavx512f Assembler for main looks like: main: .LFB2331: vcvtsi2ss %edi, %xmm1, %xmm1 subq$24, %rsp vextractf32x4 $0x0, %zmm16, (%rsp) vmovaps %zmm1, %zmm16 calltest vfmadd132ss .LC1(%rip), %xmm16, %xmm16 vmovaps %zmm16, %zmm2 movl$.LC2, %edi movl$1, %eax vunpcklps %xmm2, %xmm2, %xmm2 vcvtps2pd %xmm2, %xmm0 callprintf vmovaps %zmm16, %zmm3 vinsertf32x4$0x0, (%rsp), %zmm16, %zmm16 addq$24, %rsp vcvttss2si %xmm3, %eax ret I have no idea, why we are doind conversion to %xmm1 and then save it to %xmm16 However it maybe non-LRA issue. Thanks, K --- gcc/config/i386/i386.c | 2 +- gcc/config/i386/i386.h | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 6b13ac9..d6d8040 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -9125,7 +9125,7 @@ ix86_nsaved_sseregs (void) int nregs = 0; int regno; - if (!TARGET_64BIT_MS_ABI) + if (!(TARGET_64BIT_MS_ABI || TARGET_AVX512F)) return 0; for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) if (SSE_REGNO_P (regno) && ix86_save_reg (regno, true)) diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index d7a934d..9faab8b 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -1026,9 +1026,9 @@ enum target_cpu_default /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/ \ 6, 6,6,6,6,6,6,6, \ /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/\ - 6,6, 6,6,6,6,6,6, \ + 0,0, 0,0,0,0,0,0, \ /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/
Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512
On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote: > On 07/24/2013 05:23 AM, Richard Biener wrote: > > "H.J. Lu" wrote: > > > >> Hi, > >> > >> Here is a patch to extend x86-64 psABI to support AVX-512: > > > > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee > > saved please? > > Having them callee saved pre-supposes that one knows the width of the > register. Whole architecture of SSE/AVX is based on the fact of zerroing-upper. For references - take a look at definition of VLMAX in Spec. E.g. for AVX2 we had: vaddps %ymm1, %ymm2, %ymm3 Intuition says (at least to me) that after compilation it shouldn't have an idea of 256-bit `upper' half. But with AVX-512 we have (again, see Spec, operation section of vaddps, VEX.256 encoded): DEST[31:0] = SRC1[31:0] + SRC2[31:0] ... DEST[255:224] = SRC1[255:224] + SRC2[255:224]. DEST[MAX_VL-1:256] = 0 So, legacy code *will* change upper 256-bit of vector register. The roots can be found in GPR 64-bit insns. So, we have different behavior on 64-bit and 32-bit target for following sequence: push %eax ;; play with eax pop %eax on 64-bit machine upper 32-bits of %eax will be zeroed, and if we'll try to use old version of %rax - fail! So, following such philosophy prohibits to make vector registers callee-safe. BUT. What if we make couple of new registers calle-safe in the sense of *scalar* type? So, what we can do: 1. make callee-safe only bits [0..XXX] of vector register. 2. make call-clobbered bits of (XXX..VLMAX] in the same register. XXX is number of bits to be callee-safe: 64, 80, 128 or even 512. Advantage is that when we are doing FP scalar code, we don’t bother about save/restore callee-safe part. vaddss %xmm17, %xmm17, %xmm17 call foo vaddss %xmm17, %xmm17, %xmm17 We don’t care if `foo’: - is legacy in AVX-512 sense – it just see no xmm17 - in future ISA sense. If this code is 1024-bit wide reg and `foo’ is AVX-512. It will save XXX bits, allowing us to continue scalar calculations without saving/restore -- Thanks, K
Re: setjmp () detection in RTL
> Isn't the REG_SETJMP note sufficient for this purpose? Yeah, missed that. Sorry for flood. Thanks a lot!
setjmp () detection in RTL
Hi, Could anybody pls advise, if I can detect that given RTL `call` is actually a setjmp ()? I see no references in dump... (call_insn 6 5 7 (set (reg:SI 0 ax) (call (mem:QI (symbol_ref:DI ("_setjmp") [flags 0x41] ) [0 _setjmp S1 A8]) (const_int 0 [0]))) 4.c:17 -1 (expr_list:REG_SETJMP (const_int 0 [0]) (expr_list:REG_EH_REGION (const_int 0 [0]) (nil))) (expr_list:REG_FRAME_RELATED_EXPR (use (reg:DI 5 di)) (nil))) Thanks
Re: Vectorizer question: DIV to RSHIFT conversion
Great! Thanks, K > > Let me hack up a quick pattern recognizer for this... > > Jakub
Re: Vectorizer question: DIV to RSHIFT conversion
The full case attached. Jakub, you are right, we have to convert signed ints into something a bit more tricky. BTW, here is output for that cases from Intel compiler: vpxor %ymm1, %ymm1, %ymm1 #184.23 vmovdqu .L_2il0floatpacket.12(%rip), %ymm0#184.23 movslq%ecx, %rdi#182.7 # LOE rax rbx rdi edx ecx ymm0 ymm1 ..B1.82:# Preds ..B1.82 ..B1.81 vmovdqu 2132(%rsp,%rax,4), %ymm3 #183.14 vmovdqu 2100(%rsp,%rax,4), %ymm2 #183.27 vpsrad$8, %ymm3, %ymm11 #183.27 vpsrad$8, %ymm2, %ymm5 #183.27 vpcmpgtd %ymm5, %ymm1, %ymm4 #184.23 vpcmpgtd %ymm11, %ymm1, %ymm10 #184.23 vpand %ymm0, %ymm4, %ymm6 #184.23 vpand %ymm0, %ymm10, %ymm12 #184.23 vpaddd%ymm6, %ymm5, %ymm7 #184.23 vpaddd%ymm12, %ymm11, %ymm13#184.23 vpsrad$1, %ymm7, %ymm8 #184.23 vpsrad$1, %ymm13, %ymm14#184.23 vpaddd%ymm0, %ymm8, %ymm9 #184.23 vpaddd%ymm0, %ymm14, %ymm15 #184.23 vpslld$8, %ymm9, %ymm2 #185.27 vpslld$8, %ymm15, %ymm3 #185.27 vmovdqu %ymm2, 2100(%rsp,%rax,4) #185.10 vmovdqu %ymm3, 2132(%rsp,%rax,4) #185.10 addq $16, %rax #182.7 cmpq %rdi, %rax#182.7 jb..B1.82 # Prob 99% #182.7 Thanks, K On Tue, Dec 13, 2011 at 5:21 PM, Jakub Jelinek wrote: > On Tue, Dec 13, 2011 at 02:07:11PM +0100, Richard Guenther wrote: >> > Hi guys, >> > While looking at Spec2006/401.bzip2 I found such a loop: >> > for (i = 1; i <= alphaSize; i++) { >> > j = weight[i] >> 8; >> > j = 1 + (j / 2); >> > weight[i] = j << 8; >> > } > > It would be helpful to have a self-contained testcase, because we don't know > the types of the variables in question. Is j signed or unsigned? > Signed divide by 2 is unfortunately not equivalent to >> 1. > If j is signed int, on x86_64 we expand j / 2 as (j + (j >> 31)) >> 1. > Sure, the pattern recognizer could try that if vector division isn't > supported. > If j is unsigned int, then I'd expect it to be already canonicalized into >> > 1 by the time we enter the vectorizer. > > Jakub int weight [ 258 * 2 ]; void foo(int alphaSize) { int j, i; for (i = 1; i <= alphaSize; i++) { j = weight[i] >> 8; j = 1 + (j / 2); weight[i] = j << 8; } }
Vectorizer question: DIV to RSHIFT conversion
Hi guys, While looking at Spec2006/401.bzip2 I found such a loop: for (i = 1; i <= alphaSize; i++) { j = weight[i] >> 8; j = 1 + (j / 2); weight[i] = j << 8; } Which is not vectorizeble (using Intel's AVX2) because division by two is not recognized as rshift: 5: ==> examining statement: D.3785_6 = j_5 / 2; 5: vect_is_simple_use: operand j_5 5: def_stmt: j_5 = D.3784_4 >> 8; 5: type of def: 3. 5: vect_is_simple_use: operand 2 5: op not supported by target. 5: not vectorized: relevant stmt not supported: D.3785_6 = j_5 / 2; However, while expanding, it is successfully turned into shift: (insn 42 41 43 6 (parallel [ (set (reg:SI 107) (ashiftrt:SI (reg:SI 106) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) 1.c:7 -1 (expr_list:REG_EQUAL (div:SI (reg:SI 103) (const_int 2 [0x2])) (nil))) `Division by power of 2` conversion into shift seems to be beneficial at all. My question is, what is in your opinion best way to do such a conversion? Obvious solution will be to introduce dedicated pass which will convert all such a cases. We also may try to implement dedicated expand, but I have no idea, how to specify in the name (if possible) that second operand is something fixed. Any help is appreciated. Thanks, K
Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics
Hello Jakub, I've talked to our engineers, who work on vectorization in ICC They all said, "yes you can optimize vpxor out both in f1 and f2" Thanks, K
Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics
> %ymm0 is all ones (this is code from the auto-vectorization). > (2) is not useless, %ymm6 contains the mask, for auto-vectorization > (3) is useless, it is there just because the current gather insn patterns > always use the previous value of the destination register. Sure, I am constantly mix Intel/gcc syntax, sorry for confusion I've asked guys who are working on vectorizarion in the ICC. Seems, we may kick off zerroing of destination. Here is extract of answer (one of engineers responsible for vectorization in ICC): >>> I think zero in this situation is just a garbage value, >>> and I don’t see why GCC and ICC need to be garbage to garbage >>> compatible. If the programmer is using such a fault handler, he/she >>> should know the consequences. K
Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics
Hi Jakub, Actually I did not get the point. If we have no src/masking, destination must be unchanged until gather will write to it (at least partially) If we have all 1's in mask, scr must not be changed at all. So, nullification in intrinsics just useless. Having such snippet: (1) vmovdqa k(%rax,%rax), %ymm1 (2) vmovaps %ymm0, %ymm6 (3) vmovaps %ymm0, %ymm2 (4) vmovdqa k+32(%rax,%rax), %ymm3 (5) vgatherdps %ymm6, vf1(,%ymm1,4), %ymm2 Looks pretty strange. Which value has ymm0? If it has all zeroes, then (1)-(5) is dead code, which may be just removed. If contains all 1s then (2) s useless. But again, seems I did not get your point... Thanks, K
Re: GCC testting infrastructure issue
Thanks a lot. That is exactly what I was looking for! K On Wed, Sep 28, 2011 at 2:49 PM, Richard Guenther wrote: > On Wed, Sep 28, 2011 at 12:18 PM, Kirill Yukhin > wrote: >> Hi folks, >> I have a question. For DejaGNU we have only one option for each test. >> >> It may be e.g. either "dg-do" compile or "dg-do run". This is really >> not as suitable >> >> For instance, we cheking some new instructio autogeneration. We have >> to do 2 tests: >> 1. We have to write some routine which will contain a pattern which >> will be generated as desired instruction. And check at runtime if that >> was done correctly, comparing to some expected result. We use "dg-do >> run here" >> 2. Next we have to check that instruction really is auto generated, >> so we use "scan-assembler" for that source. >> >> May question is: am I missed something? Is there an opportunity to >> store to tests into single file? If no, why we do not have one? > > Add -save-temps via dg-options, then you can use dg-scan-assembler. > >> Here is reduced example (from gcc.target/i386): >> 1. >> /* run.c */ >> /* { dg-do run } */ >> >> int >> auto_gen_insn(args...) >> { >> /* Code to auto-gen instruction. */ >> return result; >> } >> >> int >> check_semantic(args...) >> { >> /* Code to do the same, but without desired insn. */ >> return result >> } >> >> int >> main () >> { >> if( auto_gen_insn(args...) != check_semantic(args...) ) >> abort (); >> } >> >> 2. >> /* check_gen.c */ >> /* { dg-do compile } */ >> #include "run.c" >> /* { dg-final { scan-assembler-times "insn" 1 } } */ >> >> -- >> Thanks, k >> >
GCC testting infrastructure issue
Hi folks, I have a question. For DejaGNU we have only one option for each test. It may be e.g. either "dg-do" compile or "dg-do run". This is really not as suitable For instance, we cheking some new instructio autogeneration. We have to do 2 tests: 1. We have to write some routine which will contain a pattern which will be generated as desired instruction. And check at runtime if that was done correctly, comparing to some expected result. We use "dg-do run here" 2. Next we have to check that instruction really is auto generated, so we use "scan-assembler" for that source. May question is: am I missed something? Is there an opportunity to store to tests into single file? If no, why we do not have one? Here is reduced example (from gcc.target/i386): 1. /* run.c */ /* { dg-do run } */ int auto_gen_insn(args...) { /* Code to auto-gen instruction. */ return result; } int check_semantic(args...) { /* Code to do the same, but without desired insn. */ return result } int main () { if( auto_gen_insn(args...) != check_semantic(args...) ) abort (); } 2. /* check_gen.c */ /* { dg-do compile } */ #include "run.c" /* { dg-final { scan-assembler-times "insn" 1 } } */ -- Thanks, k
Re: Defining constraint for registers tuple
That is exactly it! Thank you very much! BMI2 support is almost here :) -- K On Tue, Aug 16, 2011 at 6:58 PM, Richard Henderson wrote: > On 08/16/2011 04:20 AM, Kirill Yukhin wrote: >> Hi guys, >> the question is still opened. Let me try to explain further. >> >> The new MULX instruction is capable to store result of unsigned >> multiply to arbitrary pair of GPRs (one of operands still must be DX). >> But I have no idea, how to implement such a constraint. >> Here is define_insn which is works but uses i386's "A" constraint. It >> is much worse than using any pair of registers. > > See {u}mulsidi3_internal in mn10300.md for an example of a > double-word multiplication with two independent outputs. > > > r~ >
Re: define_split for specific split pass
I think, Ilya, wants to run his pass, say, in 208r.split4 only. Seems both split2, split3 and split4 all run under `reload_complete` set to true. Any ideas? -- Thanks, K On Tue, Aug 16, 2011 at 8:47 PM, Andrew Pinski wrote: > On Tue, Aug 16, 2011 at 6:32 AM, Ilya Enkovich wrote: >> Hello, >> >> Is there any way to specify in define_split predicate that it should >> work in some particular pass only? I need to create split which works >> in pass_split_before_sched2 only. > > So split before RA? try conventionalizing it on !reload_complete . > > Thanks, > Andrew Pinski >
Re: Defining constraint for registers tuple
Hi guys, the question is still opened. Let me try to explain further. The new MULX instruction is capable to store result of unsigned multiply to arbitrary pair of GPRs (one of operands still must be DX). But I have no idea, how to implement such a constraint. Here is define_insn which is works but uses i386's "A" constraint. It is much worse than using any pair of registers. (define_insn "*bmi2_mulx3" [(set (match_operand: 0 "register_operand" "=A") (mult: (zero_extend: (match_operand:DWIH 1 "nonimmediate_operand" "d")) (zero_extend: (match_operand:DWIH 2 "nonimmediate_operand" "rm"] "TARGET_BMI2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "mulx\t%2, %%eax, %%edx" [(set_attr "type" "imul") (set_attr "length_immediate" "0") (set_attr "mode" "")]) Maybe there is examples from other ports? Any help is appreciated Thanks, K On Mon, Aug 1, 2011 at 4:28 PM, Kirill Yukhin wrote: >> Don't change the constraint, just add an alternative. Or use a >> different insn with an insn predicate. > > This is misunderstanding beacuse of my great English :) > > I am not going to update existing constraint. I am going to implement new one. > Actually, I am looking for some expample, where similar constraint > might be implemented already. > > -- > Thanks, K >
Re: Defining constraint for registers tuple
> Don't change the constraint, just add an alternative. Or use a > different insn with an insn predicate. This is misunderstanding beacuse of my great English :) I am not going to update existing constraint. I am going to implement new one. Actually, I am looking for some expample, where similar constraint might be implemented already. -- Thanks, K
Defining constraint for registers tuple
Hi guys, I'm working on implementation of `mulx` (which is part of BMI2). One of improvements compared generic `mul` is that it allows to specify destination registers. For `mul` we have `A` constraint, which stands for AX:DX pair. So, is there a possibility to relax such cinstraint and allow any pair of registers as destination? Thanks, K