Re: [pocl-devel] Triples, targets, cpus, and features... and HOST vs. TARGET

Erik Schnetter Sun, 08 Sep 2013 13:53:58 -0700

On 2013-09-08, at 16:04 , Pekka Jääskeläinen <[email protected]> wrote:

> On 09/08/2013 10:13 PM, Erik Schnetter wrote:
>> Yes, we can use the same kernel library for all x86-64 architectures.
>> However, this would require disabling many performance features.
> 
> Which ones do you mean? The kernel libs are LLVM bitcode, which is not
> yet very target-specific as such, so the target-specific features are
> basically inline asm blocks selected using #ifdefs when building the
> bc?
> 
> IMO inline asm blocks should be avoided in the longer term anyways (try to use
> intrinsics instead) as we want to vectorize WGs as efficiently as possible
> and it's easier to vectorize intrinsics calls than inline asm blocks.

The asm inline statements are gone (in Vecmathlib); it's all done via 
intrinsics and Clang extensions.

> Same goes to vector datatypes: we might want to "scalarize" them for
> more efficient WG vectorization, so not always we want to use hand coded
> vectorized versions of functions dealing with them.
> 
> So, what about producing a "generic" bitcode lib without CPU feature
> specific inline asm blocks (perhaps only intrinsics calls), and then let
> the llc do its magic based on autodetection? The very final call to
> the llc from the fully linked work group function bitcode should be of
> the most importance here, right?

Some CPU attributes influence the ABI. These need to be set correctly at all 
times, otherwise the executable won't work. This influences e.g. the calling 
conventions for functions, which is explicitly represented in bytecode. That 
is, a fully generic bytecode library is not possible, but we may be able to get 
away with using just a few per architecture.

One would probably also need to make sure that earlier optimizations don't 
already expand builtins, since a different CPU may offer a more efficient 
implementation in terms of a CPU instruction that exists only on some CPUs 
(e.g. popcount, clz).

Apart from this -- implementing the kernel library purely with scalar functions 
and builtins is possible. We would have to experiment with how to present this 
to the vectorizer to make things as easy as possible. Currently, we split e.g. 
int16 into two int8 operations; this is a nicely recursive implementation, but 
the vectorizer may prefer a loop instead.

I should introduce an option to Vecmathlib to do this. This would easily allow 
comparing performance, and could give hints to shortcomings of the vectorizer 
(and conversely, of Vecmathlib) that could then be addressed.

-erik

-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu/.

signature.asc
Description: Message signed with OpenPGP using GPGMail

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Triples, targets, cpus, and features... and HOST vs. TARGET

Reply via email to