Re: loading of zeros into {x,y,z}mm registers

2017-12-01 Thread Kirill Yukhin
Hello Richard,
On 01 Dec 12:44, Richard Biener wrote:
> On Fri, Dec 1, 2017 at 6:45 AM, Kirill Yukhin  wrote:
> > Hello Jan,
> > On 29 Nov 08:59, Jan Beulich wrote:
> >> Kirill,
> >>
> >> in an unrelated context I've stumbled across a change of yours
> >> from Aug 2014 (revision 213847) where you "extend" the ways
> >> of loading zeros into registers. I don't understand why this was
> >> done, and the patch submission mail also doesn't give any reason.
> >> My point is that simple VEX-encoded vxorps/vxorpd/vpxor with
> >> 128-bit register operands ought to be sufficient to zero any width
> >> registers, due to the zeroing of the high parts the instructions do.
> >> Hence by using EVEX encoded insns it looks like all you do is grow
> >> the instruction length by one or two bytes (besides making the
> >> source somewhat more complicated to follow). At the very least
> >> the shorter variants should be used for -Os imo.
> > As far as I can recall, this was done since we cannot load zeroes
> > into upper 16 MM registers, which are available in EVEX exclusively.
> 
> Note on Zen pxor on %ymm also takes double amount of resources
> as that on %xmm.
> 
> It would be nice to fix this (and maybe also factor the ability to
> reference upper 16 MM regs in costing during RA ...?).
I think this is not bad idea: to replace insn when we know, that
MM regN is less than 16.
> 
> Richard.
> 
> >>
> >> Thanks for any insight,
> >> Jan
> >>
> >
> > --
> > Thanks, K

--
Thanks, K


Re: loading of zeros into {x,y,z}mm registers

2017-11-30 Thread Kirill Yukhin
Hello Jan,
On 29 Nov 08:59, Jan Beulich wrote:
> Kirill,
> 
> in an unrelated context I've stumbled across a change of yours
> from Aug 2014 (revision 213847) where you "extend" the ways
> of loading zeros into registers. I don't understand why this was
> done, and the patch submission mail also doesn't give any reason.
> My point is that simple VEX-encoded vxorps/vxorpd/vpxor with
> 128-bit register operands ought to be sufficient to zero any width
> registers, due to the zeroing of the high parts the instructions do.
> Hence by using EVEX encoded insns it looks like all you do is grow
> the instruction length by one or two bytes (besides making the
> source somewhat more complicated to follow). At the very least
> the shorter variants should be used for -Os imo.
As far as I can recall, this was done since we cannot load zeroes
into upper 16 MM registers, which are available in EVEX exclusively.

> 
> Thanks for any insight,
> Jan
>

--
Thanks, K


[RFC, VECTOR ABI] Allow __attribute__((vector)) in GCC by default.

2015-10-05 Thread Kirill Yukhin
Hello,
Recently vector ABI was introduced into GCC
Vector versions of math functions were incorporated in to GlibC
starting from v2.22.
Unfortunately, to get this functions work `-fopenmp'
switch must be added to compiler invocation. This is due to the fact that
vector variant of math functions generated using `omp declare simd' pragma.

There's an alternative to use __attribute__((vector)) for function. Currently
it's enabled under `-fcilkplus' switch.

To enable vectorization of loops w/ calls to math functions it is reasonable
to enable parsing of attribute vector for functions unconditionally and
change GlibC's header file not to use `omp declare simd', but use 
__attribute__((vector)) instead.

If community have no pre-denial, I'll prepare a patch for GCC
main trunk & GlibC.

--
Thanks, K


Re: Offloading GSOC 2015

2015-03-20 Thread Kirill Yukhin
Hello Güray,
 
On 20 Mar 12:14, guray ozen wrote:
> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
I think that here is wide range for exploration. As you know, OpenMP 4
contains vectorization pragmas (`pragma omp simd') which not perfectly
suites for GPGPU.
Another problem is how to create threads dynamically on GPGPU. As far as
we understand it there're two possible solutions:
  1. Use dynamic parallelism available in recent API (launch new kernel from
  target)
  2. Estimate maximum thread number on host and start them all from host,
  making unused threads busy-waiting
There's a paper which investigates both approaches [1], [2].

> However i'm little bit confused about which ideas, i mentioned last my
> mail, should i propose or which one of them is interesting for gcc.
> I'm willing to work on data clauses to enhance performance of shared
> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
> version. How do you think i should propose idea?
We're going to work on OpenMP 4.1 offloading features.

[1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf
[2] - http://dl.acm.org/citation.cfm?id=2688364

--
Thanks, K


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Kirill Yukhin


Hi Tobias,
On 13 Nov 16:15, Tobias Burnus wrote:
> Kirill Yukhin wrote:
> > Support of OpenMP 4.0 offloading to future Xeon Phi was
> > fully checked in to main trunk.
> 
> Thanks. If I understood it correctly:
> 
> * GCC 5 supports code generation for Xeon Phi (Knights Landing, KNL)
Right.

> * KNL (the hardware) is not yet available [mid 2015?]
Yes, but I don't know the date.

> * liboffloadmic supports offloading in an emulation mode (executed on
>   the host) but does not (yet) support offloading to KNL; i.e. one
>   would need an updated version of it, once one gets hold of the
>   actual hardware.
Yes, it supports emulation mode. Also, current scheme is the same as
for KNC (however we have no code generator in GCC main trunk for KNC).
We're going to keep liboffloadmic up-to-date.

> * The current hardware (Xeon Phi Knights Corner, KNC) is and will not
>   be supported by GCC.
Currently GCC main trunk doesn't support KNC code gen.

> * Details for building GCC for offloading and running code on an
> accelerator is at https://gcc.gnu.org/wiki/Offloading
> 
> Question: Is the latter up to date - and the item above correct?
Correct.

> BTW: you could update gcc.gnu.org ->news and gcc.gnu.org/gcc-5/changes.html
Thanks, I'll post a patch.

--
Thanks, K


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Kirill Yukhin
Hi Tobias,
On 13 Nov 16:15, Tobias Burnus wrote:
> Kirill Yukhin wrote:
> > Support of OpenMP 4.0 offloading to future Xeon Phi was
> > fully checked in to main trunk.
> 
> Thanks. If I understood it correctly:
> 
> * GCC 5 supports code generation for Xeon Phi (Knights Landing, KNL)
Right.

> * KNL (the hardware) is not yet available [mid 2015?]
Yes, but I don't know the date.

> * liboffloadmic supports offloading in an emulation mode (executed on
>   the host) but does not (yet) support offloading to KNL; i.e. one
>   would need an updated version of it, once one gets hold of the
>   actual hardware.
Yes, it supports emulation mode. Also, current scheme is the same as
for KNC (however we have no code generator in GCC main trunk for KNC).
We're going to keep liboffloadmic up-to-date.

> * The current hardware (Xeon Phi Knights Corner, KNC) is and will not
>   be supported by GCC.
Currently GCC main trunk doesn't support KNC code gen.

> * Details for building GCC for offloading and running code on an
> accelerator is at https://gcc.gnu.org/wiki/Offloading
> 
> Question: Is the latter up to date - and the item above correct?
Correct.

> BTW: you could update gcc.gnu.org ->news and gcc.gnu.org/gcc-5/changes.html
Thanks, I'll post a patch.

--
Thanks, K


Re: [PATCH 0/4] OpenMP 4.0 offloading to Intel MIC

2014-11-13 Thread Kirill Yukhin
Hello,

Support of OpenMP 4.0 offloading to future Xeon Phi was fully checked in to main
trunk.

Thanks everybody who helped w/ development and review.

--
Thanks, K


Re: Offload Library

2014-06-24 Thread Kirill Yukhin
Hello David,
On 20 Jun 14:46, David Edelsohn wrote:
> On Fri, May 16, 2014 at 7:47 AM, Kirill Yukhin  
> wrote:
> > Does this look OK?
> 
> The GCC SC has decided to allow this library in the GCC sources.
Great news, thanks!

> If the library is not going to be expanded to support all GPUs and
> offload targets, the library name should be more specific to Intel.
Sure, we'll prepare updated source and starty review nearest days.

--
Thanks, K


Re: Offload Library

2014-05-26 Thread Kirill Yukhin
Hello,
On 19 May 16:53, Kirill Yukhin wrote:
> Hello Ian,
> On 16 May 07:07, Ian Lance Taylor wrote:
> > On Fri, May 16, 2014 at 4:47 AM, Kirill Yukhin  
> > wrote:
> > >
> > > To support the offloading features for Intel's Xeon Phi cards
> > > we need to add a foreign library (liboffload) into the gcc repository.
> > > README with build instructions is attached.
> > 
> > Can you explain why this library should be part of GCC, and how GCC
> > would use it?  I'm sure it's obvious to you but it's not obvious to
> > me.
> The ‘target’ clause of OpenMP 4.0 aka ‘offloading’ support is expected to be 
> a part of 
> libgomp. Every target platform that will be supported should implement a 
> dedicated 
> plugin for libgomp. The plugin for Xeon PHI is based on the liboffload 
> functionality.
> This library also will provide compatibility for binaries built with ICC. 

Hello,
Let me provide some tech details.
This library is compiler-specific library, whose origin goes to ICC.
We think that as far as this library is a compiler-related, it is better to
put it in GCC source tree.
Also, we plan to accompany this library with an emulator of lower level 
interface.
Having such a library and an emulator, we’ll eliminate all external dependencies
for offloading to work.
So, it will be possible to do `make check’ for offload tests w/o need to
link in any external libraries, no offload HW will be needed as well.

The intended use of this library as follows. Implementation of libgomp’s
interface routines for offload (like `GOMP_target’) will call target-dependent
part of the library. This part (for COI-capable software stacks) will call
liboffload routines to perform any offload-related tasks.

This library is used by ICC and we plan to use it for offloading in LLVM as 
well.
That will make sure that offloading features for these compiler will be binary
compatible.

The library is going to be built as shared object.


--
Thanks, K


Re: Offload Library

2014-05-19 Thread Kirill Yukhin
Hello, Thomas!

On 16 May 19:30, Thomas Schwinge wrote:
> On Fri, 16 May 2014 15:47:58 +0400, Kirill Yukhin  
> wrote:
> > To support the offloading features for Intel's Xeon Phi cards
> > we need to add a foreign library (liboffload) into the gcc repository.
> 
> As written in the README, this library currently is specific to Intel
> hardware (understandably, of course), and I assume also in the future is
> to remain that way (?) -- should it thus get a more specific name in GCC,
> than the generic liboffload?
Yes, this library generates calls to Intel specific Coprocessor offload
Interface (COI).
I think, that name of library maybe changed, and when I’ll submit the patch
We'll discuss it.

> > Additionally to that sources we going to add few headers [...]
> > and couple of new sources
> 
> For interfacing with GCC, presumably.  You haven't stated it explicitly,
> but do I assume right that this work will be going onto the
> gomp-4_0-branch, integrated with the offloading work developed there, as
> a plugin for libgomp?
Not exactly. I was talking about COI emulator, which will allow to
perform testing of offload w/o any external library dependency and HW.
Libgomp <-> liboffload plug-in is also ready, but it need no such an approval,
so it’ll be submitted as separate patch.

--
Thanks, K

> Grüße,
>  Thomas




Re: Offload Library

2014-05-19 Thread Kirill Yukhin
Hello Ian,
On 16 May 07:07, Ian Lance Taylor wrote:
> On Fri, May 16, 2014 at 4:47 AM, Kirill Yukhin  
> wrote:
> >
> > To support the offloading features for Intel's Xeon Phi cards
> > we need to add a foreign library (liboffload) into the gcc repository.
> > README with build instructions is attached.
> 
> Can you explain why this library should be part of GCC, and how GCC
> would use it?  I'm sure it's obvious to you but it's not obvious to
> me.
The ‘target’ clause of OpenMP 4.0 aka ‘offloading’ support is expected to be a 
part of 
libgomp. Every target platform that will be supported should implement a 
dedicated 
plugin for libgomp. The plugin for Xeon PHI is based on the liboffload 
functionality.
This library also will provide compatibility for binaries built with ICC. 

--
Thanks, K

> 
> Ian


Offload Library

2014-05-16 Thread Kirill Yukhin
Dear steering committee,

To support the offloading features for Intel's Xeon Phi cards
we need to add a foreign library (liboffload) into the gcc repository.
README with build instructions is attached.

I am also copy-pasting the header comment from one of the liboffload files.
The header shown below will be in all the source files in liboffload.

Sources can be downloaded from [1].

Additionally to that sources we going to add few headers (released under GPL 
v2.1 license)
and couple of new sources (license in the bottom of the message).

Does this look OK?

[1] - https://www.openmprtl.org/sites/default/files/liboffload_oss.tgz

--
Thanks, K

/*
Copyright (c) 2014 Intel Corporation.  All Rights Reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

  * Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
  * Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
  * Neither the name of Intel Corporation nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

   README for Intel(R) Offload Runtime Library
   ===

How to Build Documentation
==

The main documentation is in Doxygen* format, and this distribution
should come with pre-built PDF documentation in doc/Reference.pdf.
However, an HTML version can be built by executing:

% doxygen doc/doxygen/config

in this directory.

That will produce HTML documentation in the doc/doxygen/generated
directory, which can be accessed by pointing a web browser at the
index.html file there.

If you don't have Doxygen installed, you can download it from
www.doxygen.org.


How to Build the Intel(R) Offload Runtime Library
=

The Makefile at the top-level will attempt to detect what it needs to
build the Intel(R) Offload Runtime Library.  To see the default settings,
type:

make info

You can change the Makefile's behavior with the following options:

root_dir: The path to the top-level directory containing the
  top-level Makefile.  By default, this will take on the
  value of the current working directory.

build_dir:The path to the build directory.  By default, this will
  take on value [root_dir]/build.

mpss_dir: The path to the Intel(R) Manycore Platform Software
  Stack install directory.  By default, this will take on
  the value of operating system's root directory.

compiler_host:Which compiler to use for the build of the host part.
  Defaults to "gcc"*.  Also supports "icc" and "clang"*.
  You should provide the full path to the compiler or it
  should be in the user's path.

compiler_host:Which compiler to use for the build of the target part.
  Defaults to "gcc"*.  Also supports "icc" and "clang"*.
  You should provide the full path to the compiler or it
  should be in the user's path.

options_host: Additional options for the host compiler.

options_target:   Additional options for the target compiler.

To use any of the options above, simple add =.  For
example, if you want to build with icc instead of gcc, type:

make compiler_host=icc compiler_target=icc


Supported RTL Build Configurations
==

Supported Architectures: Intel(R) 64, and Intel(R) Many Integrated
Core Architecture

  -
  |   icc/icl |gcc  |clang|
--|---|

[gomp4] Building binaries for offload.

2013-10-15 Thread Kirill Yukhin
Hello,
Let me somewhat summarize current understanding of
host binary linking as well as target binary building/linking.

We put code which supposed to be offloaded to dedicated sections,
with name starting with gnu.target_lto_

At link time (I mean, link time of host app):
  1. Generate dedicated data section in each binary (executable or DSO),
 which'll be a placeholder for offloading stuff.

  2. Generate __OPENMP_TARGET__ (weak, hidden) symbol,
 which'll point to start of the section mentioned in previous item.

This section should contain at least:
  1. Number of targets
  2. Size of offl. symbols table

  [ Repeat `number of targets']
  2. Name of target
  3. Offset to beginning of image to offload to that target
  4. Size of image

  5. Offl. symbols table

Offloading symbols table will contain information about addresses
of offloadable symbols in order to create mapping of host<->target
addresses at runtime.

To get list of target addresses we need to have dedicated interface call
to libgomp plugin, something like getTargetAddresses () which will
query target for the list of addresses (accompanied with symbol names).
To get this information target DSO should contain similar table of
mapping symbols to address.

Application is going to have single instance of libgomp, which
in turn means that we'll have single splay tree holding information
about mapping  (host -> target) for all DSO and executable.

When GOMP_target* is called, pointer to table of current execution
module is passed to libgomp along with pointer to routine (or global).
libgomp in turn:
  1. Verify in splay tree if address of given pointer (to the table)
 exists. If not - then this means given table is not yet initialized.
 libgomp initializes it (see below) and insert address of the table
 in to splay tree.
  2. Performs lookup for the address (host) in table provided
 and extracting target address.
  3. After target address is found, we perform API call (passing that address)
 to given device

We have at least 2 approaches of host->target mapping solving.

I. Preserve order of symbols appearance.
   Table row: [ address, size ]
   For routines, size to be 1

   In order to initialize the table we need to get two arrays:
   of host and target addresses. The order of appearance of objects in
   these arrays must be the same. Having this makes mapping easy.
   We just need to find index if given address in array of host addrs and
   then dereference array of target addresses with index found.

   The problem is that it unlikely will work when LTO of host is ON.
   I am also not sure, that order of handling objects on target is the same
   as on host.

II. Store symbol identifier along with address.
  Table row: [ symbol_name, address, size]
  For routines, size to be 1

  To construct the table of host addresses, at link
  time we put all symbol (marked at compile time with dedicated
  attribute) addresses to the table, accompanied with symbol names (they'll
  serve as keys)

  During initialization of the table we create host->target address mapping
  using symbol names as keys.

The last thing I wanted to summarize: compiling target code.

We have 2 approaches here:

   1. Perform WPA and extract sections, marked as target, into separate object
  file. Then call target compiler on that object file to produce the binary.

  As mentioned by Jakub, this approach will complicate debugging.

   2. Pass fat object files directly to the target compiler (one CU at a time).
  So, for every object file we are going to call GCC twice:
  - Host GCC, which will compile all host code for every CU
  - Target GCC, which will compile all target code for every CU

I vote for option #2 as far as WPA-based approach complicates debugging.
What do you guys think?

--
Thanks, K


[gomp4] GOMP_target fall back execution

2013-09-18 Thread Kirill Yukhin
Hello,
It seems that currently GOMP_target perform call to host variant of the routine:

void
GOMP_target (int device, void (*fn) (void *), const char *fnname,
 size_t mapnum, void **hostaddrs, size_t *sizes,
 unsigned char *kinds)
{
  device = resolve_device (device);
  if (device == -1)
{
  /* Host fallback.  */
  fn (hostaddrs);
  return;
}
...
}

Why not to make GOMP_target return bool and run host fallback on the 
application site
instead of library site.
Was:
  ...
  GOMP_target (handle, &foo, "foo", ...);
  ...
Proposed to be:
  ...
  if (!GOMP_target (handle, &foo, "foo", ...))
foo (hostaddrs);
  ...
The main advantage, IMHO, is that by doing so, we may probably enable inlining 
of host variant.
We also may eliminate pointer to host function and hostaddr from GOMP_target 
declaration.
Having instead of:
void
GOMP_target (int device, void (*fn) (void *), const char *fnname,
 size_t mapnum, void **hostaddrs, size_t *sizes,
 unsigned char *kinds)

This:
bool
GOMP_target (int device, const char *fnname,
 size_t mapnum, size_t *sizes,
 unsigned char *kinds)

What do you think?

--
Thanks, K


Re: [RFC] Offloading Support in libgomp

2013-09-13 Thread Kirill Yukhin
Hello,
Adding Richard who might want to take a look at LTO stuff.

--
Thanks, K


Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512

2013-08-02 Thread Kirill Yukhin
On 30 Jul 17:55, Kirill Yukhin wrote:
> On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> > On 07/24/2013 05:23 AM, Richard Biener wrote:
> > > "H.J. Lu"  wrote:
> > > 
> > >> Hi,
> > >>
> > >> Here is a patch to extend x86-64 psABI to support AVX-512:
> > > 
> > > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee 
> > > saved please?

Hello,
I've implemented a tiny patch on top of `avx512' branch.
It makes first 128-bit parts 8 registers of AVX-512 callee saved: xmm16 through 
xmm23.

Here is performance data. It seems we have a little degradation in GEOMEAN.

Workload: Spec2006
Dataset: test
Options experiment: -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast 
-funroll-loops -flto -fwhole-program -mavx512f
Options refernece : -m64 -fstrict-aliasing -fno-prefetch-loop-arrays -Ofast 
-funroll-loops -flto -fwhole-program

"8 callee-" "icount, all"   icount
"save icount"   "call-clobber"  decrease

400.perlbench   1686198567  1682320942  -0.23%
401.bzip2   18983033855 18983033907 0.00%
403.gcc 3999481141  3999095681  -0.01%
410.bwaves  13736672428 13736640026 0.00%
416.gamess  1531782811  1531350122  -0.03%
429.mcf 3079764286  3080957858  0.04%
433.milc14628097067 14628175244 0.00%
434.zeusmp  21336261982 21359384879 0.11%
435.gromacs 3593653152  3588581849  -0.14%
436.cactusADM   2822346689  2828797842  0.23%
437.leslie3d15903712760 15975143040 0.45%
444.namd42446067469 43607637322 2.74%
445.gobmk   35272482208 35268743690 -0.01%
447.dealII  42476324881 42507009849 0.07%
450.soplex  4594315045652666-0.63%
453.povray  2314481169  157619  -3.99%
454.calculix131024939   131078501   0.04%
456.hmmer   13853478444 13853306947 0.00%
458.sjeng   14173066874 14173066909 0.00%
459.GemsFDTD2437559044  2437819638  0.01%
462.libquantum  175827242   175657854   -0.10%
464.h264ref 75718510217 75711714226 -0.01%
465.tonto   2505737844  2511457541  0.23%
470.lbm 4799298802  4812180033  0.27%
473.astar   17435751523 17435498947 0.00%
481.wrf 7144685575  7170593748  0.36%
482.sphinx3 6000198462  5984438416  -0.26%
483.xalancbmk   273958223   273638145   -0.12%

GEOMEAN 4678862313  4677012093  -0.04%

Bigger % is better, negative mean that we have icount
increased after experiment


It seems to me that LRA is not always optimal, e.g. if you compile attached 
testcase
with: ./build-x86_64-linux/gcc/xgcc -B./build-x86_64-linux/gcc repro.c -S 
-Ofast -mavx512f

Assembler for main looks like:
main:
.LFB2331:
vcvtsi2ss   %edi, %xmm1, %xmm1
subq$24, %rsp
vextractf32x4   $0x0, %zmm16, (%rsp)
vmovaps %zmm1, %zmm16
calltest
vfmadd132ss .LC1(%rip), %xmm16, %xmm16
vmovaps %zmm16, %zmm2
movl$.LC2, %edi
movl$1, %eax
vunpcklps   %xmm2, %xmm2, %xmm2
vcvtps2pd   %xmm2, %xmm0
callprintf
vmovaps %zmm16, %zmm3
vinsertf32x4$0x0, (%rsp), %zmm16, %zmm16
addq$24, %rsp
vcvttss2si  %xmm3, %eax
ret
I have no idea, why we are doind conversion to %xmm1 and then save it to %xmm16
However it maybe non-LRA issue.

Thanks, K


---
 gcc/config/i386/i386.c | 2 +-
 gcc/config/i386/i386.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 6b13ac9..d6d8040 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -9125,7 +9125,7 @@ ix86_nsaved_sseregs (void)
   int nregs = 0;
   int regno;
 
-  if (!TARGET_64BIT_MS_ABI)
+  if (!(TARGET_64BIT_MS_ABI || TARGET_AVX512F))
 return 0;
   for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
 if (SSE_REGNO_P (regno) && ix86_save_reg (regno, true))
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index d7a934d..9faab8b 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1026,9 +1026,9 @@ enum target_cpu_default
 /*xmm8,xmm9,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15*/  \
  6,   6,6,6,6,6,6,6,   \
 /*xmm16,xmm17,xmm18,xmm19,xmm20,xmm21,xmm22,xmm23*/\
- 6,6, 6,6,6,6,6,6, \
+ 0,0, 0,0,0,0,0,0, \
 /*xmm24,xmm25,xmm26,xmm27,xmm28,xmm29,xmm30,xmm31*/

Re: [x86-64 psABI]: Extend x86-64 psABI to support AVX-512

2013-07-30 Thread Kirill Yukhin
On Wed, Jul 24, 2013 at 08:25:14AM -1000, Richard Henderson wrote:
> On 07/24/2013 05:23 AM, Richard Biener wrote:
> > "H.J. Lu"  wrote:
> > 
> >> Hi,
> >>
> >> Here is a patch to extend x86-64 psABI to support AVX-512:
> > 
> > Afaik avx 512 doubles the amount of xmm registers. Can we get them callee 
> > saved please?
> 
> Having them callee saved pre-supposes that one knows the width of the 
> register.

Whole architecture of SSE/AVX is based on the fact of zerroing-upper.
For references - take a look at definition of VLMAX in Spec.
E.g. for AVX2 we had:
 vaddps %ymm1, %ymm2, %ymm3

Intuition says (at least to me) that after compilation it shouldn't have an 
idea of 256-bit `upper' half.
But with AVX-512 we have (again, see Spec, operation section of vaddps, VEX.256 
encoded):
DEST[31:0] = SRC1[31:0] + SRC2[31:0]
...
DEST[255:224] = SRC1[255:224] + SRC2[255:224].
DEST[MAX_VL-1:256] = 0
So, legacy code *will* change upper 256-bit of vector register.

The roots can be found in GPR 64-bit insns. So, we have different behavior on 
64-bit and 32-bit target for following sequence:
push %eax
;; play with eax
pop %eax
on 64-bit machine upper 32-bits of %eax will be zeroed, and if we'll try to use 
old version of %rax - fail!

So, following such philosophy prohibits to make vector registers callee-safe.

BUT.

What if we make couple of new registers calle-safe in the sense of *scalar* 
type?
So, what we can do:
1. make callee-safe only bits [0..XXX] of vector register.
2. make call-clobbered bits of (XXX..VLMAX] in the same register.

XXX is number of bits to be callee-safe: 64, 80, 128 or even 512.

Advantage is that when we are doing FP scalar code, we don’t bother about 
save/restore callee-safe part.
vaddss %xmm17, %xmm17, %xmm17
call foo
vaddss %xmm17, %xmm17, %xmm17

We don’t care if `foo’:
- is legacy in AVX-512 sense – it just see no xmm17
- in future ISA sense. If this code is 1024-bit wide reg and `foo’ is 
AVX-512. It will save XXX bits, allowing us to continue scalar calculations 
without saving/restore

--
Thanks, K


Re: setjmp () detection in RTL

2013-02-14 Thread Kirill Yukhin
> Isn't the REG_SETJMP note sufficient for this purpose?
Yeah, missed that. Sorry for flood. Thanks a lot!


setjmp () detection in RTL

2013-02-14 Thread Kirill Yukhin
Hi,
Could anybody pls advise, if I can detect that given RTL `call` is
actually a setjmp ()?

I see no references in dump...
(call_insn 6 5 7 (set (reg:SI 0 ax)
(call (mem:QI (symbol_ref:DI ("_setjmp") [flags 0x41]
) [0 _setjmp S1 A8])
(const_int 0 [0]))) 4.c:17 -1
 (expr_list:REG_SETJMP (const_int 0 [0])
(expr_list:REG_EH_REGION (const_int 0 [0])
(nil)))
(expr_list:REG_FRAME_RELATED_EXPR (use (reg:DI 5 di))
(nil)))

Thanks


Re: Vectorizer question: DIV to RSHIFT conversion

2011-12-13 Thread Kirill Yukhin
Great!

Thanks, K
>
> Let me hack up a quick pattern recognizer for this...
>
>        Jakub


Re: Vectorizer question: DIV to RSHIFT conversion

2011-12-13 Thread Kirill Yukhin
The full case attached.

Jakub, you are right, we have to convert signed ints into something a
bit more tricky.
BTW, here is output for that cases from Intel compiler:

vpxor %ymm1, %ymm1, %ymm1   #184.23
vmovdqu   .L_2il0floatpacket.12(%rip), %ymm0#184.23
movslq%ecx, %rdi#182.7
# LOE rax rbx rdi edx ecx ymm0 ymm1
..B1.82:# Preds ..B1.82 ..B1.81
vmovdqu   2132(%rsp,%rax,4), %ymm3  #183.14
vmovdqu   2100(%rsp,%rax,4), %ymm2  #183.27
vpsrad$8, %ymm3, %ymm11 #183.27
vpsrad$8, %ymm2, %ymm5  #183.27
vpcmpgtd  %ymm5, %ymm1, %ymm4   #184.23
vpcmpgtd  %ymm11, %ymm1, %ymm10 #184.23
vpand %ymm0, %ymm4, %ymm6   #184.23
vpand %ymm0, %ymm10, %ymm12 #184.23
vpaddd%ymm6, %ymm5, %ymm7   #184.23
vpaddd%ymm12, %ymm11, %ymm13#184.23
vpsrad$1, %ymm7, %ymm8  #184.23
vpsrad$1, %ymm13, %ymm14#184.23
vpaddd%ymm0, %ymm8, %ymm9   #184.23
vpaddd%ymm0, %ymm14, %ymm15 #184.23
vpslld$8, %ymm9, %ymm2  #185.27
vpslld$8, %ymm15, %ymm3 #185.27
vmovdqu   %ymm2, 2100(%rsp,%rax,4)  #185.10
vmovdqu   %ymm3, 2132(%rsp,%rax,4)  #185.10
addq  $16, %rax #182.7
cmpq  %rdi, %rax#182.7
jb..B1.82   # Prob 99%  #182.7

Thanks, K

On Tue, Dec 13, 2011 at 5:21 PM, Jakub Jelinek  wrote:
> On Tue, Dec 13, 2011 at 02:07:11PM +0100, Richard Guenther wrote:
>> > Hi guys,
>> > While looking at Spec2006/401.bzip2 I found such a loop:
>> >     for (i = 1; i <= alphaSize; i++) {
>> >       j = weight[i] >> 8;
>> >       j = 1 + (j / 2);
>> >       weight[i] = j << 8;
>> >     }
>
> It would be helpful to have a self-contained testcase, because we don't know
> the types of the variables in question.  Is j signed or unsigned?
> Signed divide by 2 is unfortunately not equivalent to >> 1.
> If j is signed int, on x86_64 we expand j / 2 as (j + (j >> 31)) >> 1.
> Sure, the pattern recognizer could try that if vector division isn't
> supported.
> If j is unsigned int, then I'd expect it to be already canonicalized into >>
> 1 by the time we enter the vectorizer.
>
>        Jakub
int weight [ 258 * 2 ];

void foo(int alphaSize) {   
  int j, i;
  for (i = 1; i <= alphaSize; i++) {
j = weight[i] >> 8;
j = 1 + (j / 2);
weight[i] = j << 8;
  }
}


Vectorizer question: DIV to RSHIFT conversion

2011-12-13 Thread Kirill Yukhin
Hi guys,
While looking at Spec2006/401.bzip2 I found such a loop:
for (i = 1; i <= alphaSize; i++) {
  j = weight[i] >> 8;
  j = 1 + (j / 2);
  weight[i] = j << 8;
}

Which is not vectorizeble (using Intel's AVX2) because division by two
is not recognized as rshift:
  5: ==> examining statement: D.3785_6 = j_5 / 2;

  5: vect_is_simple_use: operand j_5
  5: def_stmt: j_5 = D.3784_4 >> 8;

  5: type of def: 3.
  5: vect_is_simple_use: operand 2
  5: op not supported by target.
  5: not vectorized: relevant stmt not supported: D.3785_6 = j_5 / 2;

However, while expanding, it is successfully turned into shift:
  (insn 42 41 43 6 (parallel [
  (set (reg:SI 107)
  (ashiftrt:SI (reg:SI 106)
  (const_int 1 [0x1])))
  (clobber (reg:CC 17 flags))
  ]) 1.c:7 -1
   (expr_list:REG_EQUAL (div:SI (reg:SI 103)
  (const_int 2 [0x2]))
  (nil)))

`Division by power of 2` conversion into shift seems to be beneficial at all.
My question is, what is in your opinion best way to do such a conversion?
Obvious solution will be to introduce dedicated pass which will
convert all such a cases.
We also may try to implement dedicated expand, but I have no idea, how
to specify in the name (if possible) that second operand is something
fixed.

Any help is appreciated.

Thanks, K


Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics

2011-11-05 Thread Kirill Yukhin
Hello Jakub,
I've talked to our engineers, who work on vectorization in ICC
They all said, "yes you can optimize vpxor out both in f1 and f2"

Thanks, K


Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics

2011-11-03 Thread Kirill Yukhin
> %ymm0 is all ones (this is code from the auto-vectorization).
> (2) is not useless, %ymm6 contains the mask, for auto-vectorization
> (3) is useless, it is there just because the current gather insn patterns
> always use the previous value of the destination register.
Sure, I am constantly mix Intel/gcc syntax, sorry for confusion

I've asked guys who are working on vectorizarion in the ICC.
Seems, we may kick off zerroing of destination.
Here is extract of answer (one of engineers responsible for
vectorization in ICC):

>>> I think zero in this situation is just a garbage value,
>>> and I don’t see why GCC and ICC need to be garbage to garbage
>>> compatible. If the programmer is using such a fault handler, he/she
>>> should know the consequences.

K


Re: _mm{,256}_i{32,64}gather_{ps,pd,epi32,epi64} intrinsics semantics

2011-11-02 Thread Kirill Yukhin
Hi Jakub,
Actually I did not get the point.
If we have no src/masking, destination must be unchanged until gather
will write to it (at least partially)
If we have all 1's in mask, scr must not be changed at all.
So, nullification in intrinsics just useless.
Having such snippet:
(1)   vmovdqa k(%rax,%rax), %ymm1
(2)   vmovaps %ymm0, %ymm6
(3)   vmovaps %ymm0, %ymm2
(4)   vmovdqa k+32(%rax,%rax), %ymm3
(5)   vgatherdps  %ymm6, vf1(,%ymm1,4), %ymm2

Looks pretty strange. Which value has ymm0? If it has all zeroes, then
(1)-(5) is dead code, which may be just removed.
If contains all 1s then (2) s useless.
But again, seems I did not get your point...

Thanks, K


Re: GCC testting infrastructure issue

2011-09-28 Thread Kirill Yukhin
Thanks a lot. That is exactly what I was looking for!

K

On Wed, Sep 28, 2011 at 2:49 PM, Richard Guenther
 wrote:
> On Wed, Sep 28, 2011 at 12:18 PM, Kirill Yukhin  
> wrote:
>> Hi folks,
>> I have a question. For DejaGNU we have only one option for each test.
>>
>> It may be e.g. either "dg-do" compile or "dg-do run". This is really
>> not as suitable
>>
>> For instance, we cheking some new instructio autogeneration. We have
>> to do 2 tests:
>>  1. We have to write some routine which will contain a pattern which
>> will be generated as desired instruction. And check at runtime if that
>> was done correctly, comparing to some expected result. We use "dg-do
>> run here"
>>  2. Next we have to check that instruction really is auto generated,
>> so we use "scan-assembler" for that source.
>>
>> May question is: am I missed something? Is there an opportunity to
>> store to tests into single file? If no, why we do not have one?
>
> Add -save-temps via dg-options, then you can use dg-scan-assembler.
>
>> Here is reduced example (from gcc.target/i386):
>> 1.
>> /* run.c */
>> /* { dg-do run } */
>>
>> int
>> auto_gen_insn(args...)
>> {
>>  /* Code to auto-gen instruction. */
>>  return result;
>> }
>>
>> int
>> check_semantic(args...)
>> {
>>  /* Code to do the same, but without desired insn. */
>>  return result
>> }
>>
>> int
>> main ()
>> {
>>  if( auto_gen_insn(args...) != check_semantic(args...) )
>>    abort ();
>> }
>>
>> 2.
>> /* check_gen.c */
>> /* { dg-do compile } */
>> #include "run.c"
>> /* { dg-final { scan-assembler-times "insn" 1 } } */
>>
>> --
>> Thanks, k
>>
>


GCC testting infrastructure issue

2011-09-28 Thread Kirill Yukhin
Hi folks,
I have a question. For DejaGNU we have only one option for each test.

It may be e.g. either "dg-do" compile or "dg-do run". This is really
not as suitable

For instance, we cheking some new instructio autogeneration. We have
to do 2 tests:
  1. We have to write some routine which will contain a pattern which
will be generated as desired instruction. And check at runtime if that
was done correctly, comparing to some expected result. We use "dg-do
run here"
  2. Next we have to check that instruction really is auto generated,
so we use "scan-assembler" for that source.

May question is: am I missed something? Is there an opportunity to
store to tests into single file? If no, why we do not have one?

Here is reduced example (from gcc.target/i386):
1.
/* run.c */
/* { dg-do run } */

int
auto_gen_insn(args...)
{
  /* Code to auto-gen instruction. */
  return result;
}

int
check_semantic(args...)
{
  /* Code to do the same, but without desired insn. */
  return result
}

int
main ()
{
  if( auto_gen_insn(args...) != check_semantic(args...) )
abort ();
}

2.
/* check_gen.c */
/* { dg-do compile } */
#include "run.c"
/* { dg-final { scan-assembler-times "insn" 1 } } */

--
Thanks, k


Re: Defining constraint for registers tuple

2011-08-16 Thread Kirill Yukhin
That is exactly it! Thank you very much!
BMI2 support is almost here :)

--
K

On Tue, Aug 16, 2011 at 6:58 PM, Richard Henderson  wrote:
> On 08/16/2011 04:20 AM, Kirill Yukhin wrote:
>> Hi guys,
>> the question is still opened. Let me try to explain further.
>>
>> The new MULX instruction is capable to store result of unsigned
>> multiply to arbitrary pair of GPRs (one of operands still must be DX).
>> But I have no idea, how to implement such a constraint.
>> Here is define_insn which is works but uses i386's "A" constraint. It
>> is much worse than using any pair of registers.
>
> See {u}mulsidi3_internal in mn10300.md for an example of a
> double-word multiplication with two independent outputs.
>
>
> r~
>


Re: define_split for specific split pass

2011-08-16 Thread Kirill Yukhin
I think, Ilya, wants to run his pass, say, in 208r.split4 only. Seems
both split2, split3 and split4 all run under `reload_complete` set to
true.

Any ideas?

--
Thanks, K

On Tue, Aug 16, 2011 at 8:47 PM, Andrew Pinski  wrote:
> On Tue, Aug 16, 2011 at 6:32 AM, Ilya Enkovich  wrote:
>> Hello,
>>
>> Is there any way to specify in define_split predicate that it should
>> work in some particular pass only? I need to create split which works
>> in pass_split_before_sched2 only.
>
> So split before RA?  try conventionalizing it on !reload_complete .
>
> Thanks,
> Andrew Pinski
>


Re: Defining constraint for registers tuple

2011-08-16 Thread Kirill Yukhin
Hi guys,
the question is still opened. Let me try to explain further.

The new MULX instruction is capable to store result of unsigned
multiply to arbitrary pair of GPRs (one of operands still must be DX).
But I have no idea, how to implement such a constraint.
Here is define_insn which is works but uses i386's "A" constraint. It
is much worse than using any pair of registers.
(define_insn "*bmi2_mulx3"
  [(set (match_operand: 0 "register_operand" "=A")
(mult:
  (zero_extend:
(match_operand:DWIH 1 "nonimmediate_operand" "d"))
  (zero_extend:
(match_operand:DWIH 2 "nonimmediate_operand" "rm"]
 "TARGET_BMI2 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "mulx\t%2, %%eax, %%edx"
  [(set_attr "type" "imul")
   (set_attr "length_immediate" "0")
   (set_attr "mode" "")])

Maybe there is examples from other ports? Any help is appreciated

Thanks, K



On Mon, Aug 1, 2011 at 4:28 PM, Kirill Yukhin  wrote:
>> Don't change the constraint, just add an alternative.  Or use a
>> different insn with an insn predicate.
>
> This is misunderstanding beacuse of my great English :)
>
> I am not going to update existing constraint. I am going to implement new one.
> Actually, I am looking for some expample, where similar constraint
> might be implemented already.
>
> --
> Thanks, K
>


Re: Defining constraint for registers tuple

2011-08-01 Thread Kirill Yukhin
> Don't change the constraint, just add an alternative.  Or use a
> different insn with an insn predicate.

This is misunderstanding beacuse of my great English :)

I am not going to update existing constraint. I am going to implement new one.
Actually, I am looking for some expample, where similar constraint
might be implemented already.

--
Thanks, K


Defining constraint for registers tuple

2011-07-29 Thread Kirill Yukhin
Hi guys,
I'm working on implementation of `mulx` (which is part of BMI2). One
of improvements compared generic `mul` is that it allows to specify
destination registers.
For `mul` we have `A` constraint, which stands for AX:DX pair.
So, is there a possibility to relax such cinstraint and allow any pair
of registers as destination?

Thanks, K