Re: [webkit-dev] SIMD support in JavaScript
Hi Maciej, - Original Message - > > Dan, you say that SIMD.js delivers performance portability, and Nadav says it > doesn’t. > > Nadav’s argument seems to come down to (as I understand it): > - The set of vector operations supported on different CPU architectures > varies widely. This is true, but it's also true that there is a core set of features which is pretty consistent across popular SIMD architectures. This commonality exists because it's a very popular set. The proposed SIMD.js doesn't solve all problems, but it does solve a large number of important problems well, and it is following numerous precedents. We are also exploring the possibility of exposing additional instructions outside this core set. Several creative ideas are being discussed which could expand the API's reach while preserving a portability story. However, regardless of what we do there, I expect the core set will remain a prominent part of the API, due to its applicability. > - "Executing vector intrinsics on processors that don’t support them is > slower than executing multiple scalar instructions because the compiler > can’t always generate efficient with the same semantics.” This is also true, however the intent of SIMD.js *is* to be implementable on all popular architectures. The SIMD.js spec is originally derived from the Dart SIMD spec, which is already implemented and in use on at least x86 and ARM. We are also taking some ideas from OpenCL, which offers a very similar set of core functionality, and which is implemented on even more architectures. We have several reasons to expect that SIMD.js can cover enough functionality to be useful while still being sufficiently portable. > - Even when vector intrinsics are supported by the CPU, whether it is > profitable to use them may depend in non-obvious ways on exact > characteristics of the target CPU and the surrounding code (the Port5 > example). With SIMD.js, there are plain integer types, so developers directly bypass plain JS number semantics, so there are fewer corner cases for the compiler to insert extra code to check for. This means fewer branches, and among other things, should mean less port 5 contention overall on Sandy Bridge. Furthermore, automatic vectorization often requires the compiler make conservative assumptions about key information like pointer aliasing, trip counts, integer overflow, array indexing, load safety, scatter ordering, alignment, and more. In order to preserve observable semantics, these assumptions cause compilers to insert extra instructions, which are typically things like selects, shuffles, branches or other things, to handle all the possible corner cases. This is extra overhead that human programmers can often avoid, because they can more easily determine what corner cases are relevant in a given piece of code. And on Sandy Bridge in particular, these extra selects, shuffles, and branches hit port 5. > For these reasons, Nadav says that it’s better to autovectorize, and that > this is the norm even for languages with explicit vector data. In other > words, he’s saying that SIMD.js will result in code that is not > performance-portable between different CPUs. I question whether it is actually the norm. In C++, where auto-vectorization is available in every major compiler today, explicit SIMD APIs like are hugely popular. That particular header has become supported by Microsoft's C++ compiler, Intel's C++ compiler, GCC, and clang. I see many uses of in many contexts, including HPC, graphics, codecs, cryptography, and games. It seems many C++ developers are still willing to go through the pain of #ifdefs, preprocessor macros, and funny-looking syntax rather than rely on auto-vectorization, even with "restrict" and and other aids. Both auto-vectorization and SIMD.js have their strengths, and both have their weaknesses. I don't believe the fact that both solve some problems that the other doesn't rules out either of them. > I don’t see a rebuttal to any of these points. Instead, you argue that, > because SIMD.js does not require advanced compiler analysis, it is more > likely to give similar results between different JITs (presumably when > targeting the same CPU, or ones with the same supported vector operations > and similar perf characteristics). That seems like a totally different sense > of performance portability. > > Given these arguments, it’s possible that you and Nadav are both right[*]. > That would mean that both these statements hold: > (a) SIMD.js is not performance-portable between different CPU architectures > and models. > (b) SIMD.js is performance-portable between different JITs targeting the same > CPU model. > > On net, I think that combination would be a strong argument *against* > SIMD.js. The Web aims for portability between different hardware and not > just different software. At Apple alone we support four major CPU > instruction sets and a considerably greate
Re: [webkit-dev] SIMD support in JavaScript
Hi Nadav, - Original Message - > Hi Dan! > > > On Sep 28, 2014, at 6:44 AM, Dan Gohman wrote: > > > > Hi Nadav, > > > > I agree with much of your assessment of the the proposed SIMD.js API. > > However, I don't believe it's unsuitability for some problems > > invalidates it for solving other very important problems, which it is > > well suited for. Performance portability is actually one of SIMD.js' > > biggest strengths: it's not the kind of performance portability that > > aims for a consistent percentage of peak on every machine (which, as you > > note, of course an explicit 128-bit SIMD API won't achieve), it's the > > kind of performance portability that achieves predictable performance > > and minimizes surprises across machines (though yes, there are some > > unavoidable ones, but overall the picture is quite good). > > There is a tradeoff between the performance portability of the SIMD.js ISA > and its usefulness. A small number of instructions (that only targets 32bit > data types, no masks, etc) is not useful for developing non-trivial vector > programs. You need 16bit vector elements to support WebGL vertex indices, > and lane-masking for implementing predicated control flow for programs like > ray tracers. Introducing a large number of vector instructions will expose > the performance portability problems. I don’t believe that there is a sweet > spot in this tradeoff. I don’t think that we can find a small set of > instructions that will be useful for writing non-trivial vector code that is > performance portable. My belief in the existence of a sweet spot is based on looking at other systems, hardware and software, that have already gone there. For an interesting example, take a look at this page: https://software.intel.com/en-us/articles/interactive-ray-tracing Every SIMD operation used in that article is directly supported by a corresponding function in SIMD.js today. We do have an open question on whether we should do something different for the rsqrt instruction, since the hardware only provides an approximation. In this case the code requires some Newton-Raphson, which may give us some flexibility, but several things are possible there. And of course, sweet spot doesn't mean cure-all. Also, I am preparing to propose that SIMD.js handle 16-bit vector elements too (int16x8). It fits pretty naturally into the overall model. There are some challenges on some architectures, but there are challenges with alternative approaches too, and overall the story looks good. Other changes are also being discussed too. In general, the SIMD.js spec is still evolving; participation is welcome :-). > > This is an example of a weakness of depending on automatic vectorization > > alone. High-level language features create complications which can lead > > to surprising performance problems. Compiler transformations to target > > specialized hardware features often have widely varying applicability. > > Expensive analyses can sometimes enable more and better vectorization, > > but when a compiler has to do an expensive complex analysis in order to > > optimize, it's unlikely that a programmer can count on other compilers > > doing the exact same analysis and optimizing in all the same cases. This > > is a problem we already face in many areas of compilers, but it's more > > pronounced with vectorization than many other optimizations. > > I agree with this argument. Compiler optimizations are unpredictable. You > never know when the register allocator will decide to spill a variable > inside a hot loop. or a memory operation confuse the alias analysis. I also > agree that loop vectorization is especially sensitive. > However, it looks like the kind of vectorization that is needed to replace > SIMD.js is a very simple SLP vectorization > <http://llvm.org/docs/Vectorizers.html#the-slp-vectorizer> (BB > vectorization). It is really easy for a compiler to combine a few scalar > arithmetic operations into a vector. LLVM’s SLP-vectorizer support > vectorization of computations across basic blocks and succeeds in surprising > places, like vectorization of STDLIB code where the ‘begin' and ‘end' > iterators fit into a 128-bit register! That's a surprising trick! I agree that SLP vectorization doesn't have the same level of "performance cliff" as loop vectorization. And, it may be a desirable thing for JS JITs to start doing. Even so, there is still value in an explicit SIMD API in the present. For the core features, instead of giving developers sets of expression patterns to follow to ensure SLP recognition, we are giving names to those patterns and letting developers identify which
Re: [webkit-dev] SIMD support in JavaScript
Hi Nadav, I agree with much of your assessment of the the proposed SIMD.js API. However, I don't believe it's unsuitability for some problems invalidates it for solving other very important problems, which it is well suited for. Performance portability is actually one of SIMD.js' biggest strengths: it's not the kind of performance portability that aims for a consistent percentage of peak on every machine (which, as you note, of course an explicit 128-bit SIMD API won't achieve), it's the kind of performance portability that achieves predictable performance and minimizes surprises across machines (though yes, there are some unavoidable ones, but overall the picture is quite good). On 09/26/2014 03:16 PM, Nadav Rotem wrote: > So far, I’ve explained why I believe SIMD.js will not be > performance-portable and why it will not utilize modern instruction > sets, but I have not made a suggestion on how to use vector > instructions to accelerate JavaScript programs. Vectorization, like > instruction scheduling and register allocation, is a code-generation > problem. In order to solve these problems, it is necessary for the > compiler to have intimate knowledge of the architecture. Forcing the > compiler to use a specific instruction or a specific data-type is the > wrong answer. We can learn a lesson from the design of compilers for > data-parallel languages. GPU programs (shaders and compute languages, > such as OpenCL and GLSL) are written using vector instructions because > the domain of the problem requires vectors (colors and coordinates). > One of the first thing that data-parallel compilers do is to break > vector instructions into scalars (this process is called > scalarization). After getting rid of the vectors that resulted from > the problem domain, the compiler may begin to analyze the program, > calculate profitability, and make use of the available instruction set. > I believe that it is the responsibility of JIT compilers to use vector > instructions. In the implementation of the Webkit’s FTL JIT compiler, > we took one step in the direction of using vector instructions. LLVM > already vectorizes some code sequences during instruction selection, > and we started investigating the use of LLVM’s Loop and SLP > vectorizers. We found that despite nice performance gains on a number > of workloads, we experienced some performance regressions on Intel’s > Sandybridge processors, which is currently a very popular desktop > processor. JavaScript code contains many branches (due to dynamic > speculation). Unfortunately, branches on Sandybridge execute on Port5, > which is also where many vector instructions are executed. So, > pressure on Port5 prevented performance gains. The LLVM vectorizer > currently does not model execution port pressure and we had to disable > vectorization in FTL. In the future, we intend to enable more > vectorization features in FTL. This is an example of a weakness of depending on automatic vectorization alone. High-level language features create complications which can lead to surprising performance problems. Compiler transformations to target specialized hardware features often have widely varying applicability. Expensive analyses can sometimes enable more and better vectorization, but when a compiler has to do an expensive complex analysis in order to optimize, it's unlikely that a programmer can count on other compilers doing the exact same analysis and optimizing in all the same cases. This is a problem we already face in many areas of compilers, but it's more pronounced with vectorization than many other optimizations. In contrast, the proposed SIMD.js has the property that code using it will not depend on expensive compiler analysis in the JIT, and is much more likely to deliver predictable performance in practice between different JIT implementations and across a very practical variety of hardware architectures. > > To summarize, SIMD.js will not provide a portable performance solution > because vector instruction sets are sparse and vary between > architectures and generations. Emscripten should not generate vector > instructions because it can’t model the target machine. SIMD.js will > not make use of modern SIMD features such as predication or > scatter/gather. Vectorization is a compiler code generation problem > that should be solved by JIT compilers, and not by the language > itself. JIT compilers should continue to evolve and to start > vectorizing code like modern compilers. As I mentioned above, performance portability is actually one of SIMD.js's core strengths. I have found it useful to think of the API propsed in SIMD.js as a "short vector" API. It hits a sweet spot, being a convenient size for many XYZW and RGB/RGBA and similar algorithms, being implementable on a wide variety of very relevant hardware architectures, being long enough to deliver worthwhile speedups for many tasks, and being short enough to still be convenient to manipulate. I agree that