Taking pipeline processing to the next level
I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient. This sort of one-by-one software design is the core performance problem with OOP. It seems a shame to be suffering OOP's failures even when there is no OOP in sight. A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices. All functions are implemented with the presumption they will operate on many things, rather than being called many times for each one. This is the single most successful design pattern I have ever encountered wrt high-performance code; ie, implement the array version first. The problem with this API design, is that it doesn't plug into algorithms or generic code well. data.map!(x => transformThings(&x, 1)).copy(output); I often wonder how we can integrate this design principle conveniently (ie, seamlessly) into the design of algorithms, such that they can make use of batching functions internally, and transparently? Has anyone done any work in this area? Ideas?
Re: Taking pipeline processing to the next level
On 05/09/2016 5:08 PM, Manu via Digitalmars-d wrote: I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient. This sort of one-by-one software design is the core performance problem with OOP. It seems a shame to be suffering OOP's failures even when there is no OOP in sight. A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices. All functions are implemented with the presumption they will operate on many things, rather than being called many times for each one. This is the single most successful design pattern I have ever encountered wrt high-performance code; ie, implement the array version first. The problem with this API design, is that it doesn't plug into algorithms or generic code well. data.map!(x => transformThings(&x, 1)).copy(output); I often wonder how we can integrate this design principle conveniently (ie, seamlessly) into the design of algorithms, such that they can make use of batching functions internally, and transparently? Has anyone done any work in this area? Ideas? Just a random idea: import std.array : front, popFront, empty; import std.range : ElementType, isInputRange; int[] transformer(I)(I from, int[] buffer) if (is(ElementType!I == int) && isInputRange!I) { size_t used; // ... transformation algo return buffer[0 .. used]; } auto got = input.transformer(buffer).usage; Input range instead of straight array being passed in so it works on pretty much any input arrays or input ranges.
Re: Taking pipeline processing to the next level
On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote: > I mostly code like this now: > data.map!(x => transform(x)).copy(output); > > It's convenient and reads nicely, but it's generally inefficient. > This sort of one-by-one software design is the core performance > problem with OOP. It seems a shame to be suffering OOP's failures even > when there is no OOP in sight. In functional languages, where these ideas originated, it's mostly the compiler's job to make things efficient. I'd argue that we should rather be looking at how to improve the compiler. As far as I know, currently even D compilers don't really take full advantage of the range idiom; i.e., their optimizers are written for the general case of arbitrary code, so while you do get some benefits, it may not necessarily translate to overall efficient code. Dmd is particularly bad at optimizing range-based code, mainly because its inliner gives up too easily. Nest the range API calls a level too deep, and the inliner fails to inline it, and as a result a whole series of subsequent optimizations are missed. IME, gdc does a bit better -- it's able to translate simple range compositions into vectorized operations in some cases that I've seen. Yes, I realize that there's some debate about whether auto-vectorization is a good thing or not, but the point is that gdc's optimizer is well able to translate what you call a one-by-one design into something that's closer to block-by-block. So it's not out of the realm of possibility that the compiler could do such things for you. Now imagine if the D frontend is able to recognize certain code patterns characteristic of range-based code, and optimize accordingly. In the front end, you have access to the AST and to higher-level constructs in the language, so it's more likely to be able to reliably detect range-based code, and do something useful with it. Conceptually, such an optimizer could see code like this: data.map!(x => 2*x + 1) .map!(x => f(x) - 1) .copy(output); and produce, as an initial step, something like: foreach (i, x; data) { output[i] = f(2*x + 1) - 1; } Then this gets passed through the backend optimizer which does loop optimizations like strength reduction, unrolling, etc.. The unrolled loop could then get passed over another optimizer iteration to produce vectorized code, or the equivalent transformation into block-based operations. There's a lot of optimization potential with range-based code that currently isn't really exploited by existing D compilers. It would be interesting to investigate what optimizations are possible and to implement them in the D front-end. > A central premise of performance-oriented programming which I've > employed my entire career, is "where there is one, there is probably > many", and if you do something to one, you should do it to many. With > this in mind, the code I have always written doesn't tend to look like > this: > R manipulate(Thing thing); > > Instead: > void manipulateThings(Thing *things, size_t numThings, R *output, > size_t outputLen); > > Written this way for clarity. Obviously, the D equiv uses slices. Actually, isn't this the whole premise of range-based code? Instead of dealing with individual items, you work with a range of items, and not necessarily arrays either. Though for obvious reasons arrays would tend to perform better than the general range. > All functions are implemented with the presumption they will operate > on many things, rather than being called many times for each one. > This is the single most successful design pattern I have ever > encountered wrt high-performance code; ie, implement the array version > first. In D, that would be "implement the range version first" (and then you realize you don't need to implement any other version afterwards, because the range version already covers everything else :-P). > The problem with this API design, is that it doesn't plug into > algorithms or generic code well. > data.map!(x => transformThings(&x, 1)).copy(output); > > I often wonder how we can integrate this design principle conveniently > (ie, seamlessly) into the design of algorithms, such that they can > make use of batching functions internally, and transparently? > > Has anyone done any work in this area? > > Ideas? One thought that occurs to me is to focus on optimizing code involving .map. Obviously, a naive translation of a UFCS chain involving .map would involve calling the transformation function one item at a time. But it doesn't have to be that way, especially if the function is a lambda or inlineable. Conceivably, the optimizer should first inline the function, then unroll the loop, then realize that further optimizations can be applied across multiple loop iterations to turn it into batched code. Of course, compiler improvements are kinda more in the long term; in the short term, what you might be
Re: Taking pipeline processing to the next level
On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote: I mostly code like this now: data.map!(x => transform(x)).copy(output); It's convenient and reads nicely, but it's generally inefficient. What are the benchmarks and the numbers? What loss are you looking at? -- Andrei
Re: Taking pipeline processing to the next level
Am Mon, 5 Sep 2016 10:21:53 +0200 schrieb Andrei Alexandrescu : > On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote: > > I mostly code like this now: > > data.map!(x => transform(x)).copy(output); > > > > It's convenient and reads nicely, but it's generally inefficient. > > What are the benchmarks and the numbers? What loss are you looking > at? -- Andrei As Manu posted this question (and he's working on a color/image library) it's not hard to guess one problem is SIMD/vectorization. E.g if transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 values instead of 4 individual adds. As autovectorization is not very powerful in current compilers I can easily imagine that complex range based examples can't compete with hand-written SIMD loops. @Manu: Have you had a look at std.parallelism? I think it has some kind of parallel map which could provide some inspiration?
Re: Taking pipeline processing to the next level
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote: A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. With this in mind, the code I have always written doesn't tend to look like this: R manipulate(Thing thing); Instead: void manipulateThings(Thing *things, size_t numThings, R *output, size_t outputLen); Written this way for clarity. Obviously, the D equiv uses slices. To chime in, processing in chunks is indeed absolutely key for audio performance. Processing by element is only tolerable for prototyping. I don't even use slices since but usually pointers and length, since most slice length would be the same. void processSamples(float* input, float* output, int samples); Some kind of buffered input and output range concept would be needed there if a lazy library was to be made. Unfortunately for this case inputs and outputs are rather diverse. Can't wait for a more formalized DbI introduction. As for graphics, ae.utils.graphics usually give you a whole line of pixels.
Re: Taking pipeline processing to the next level
On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote: What are the benchmarks and the numbers? What loss are you looking at? -- Andrei Just looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types.
Re: Taking pipeline processing to the next level
On 9/5/16 1:43 PM, Ethan Watson wrote: On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote: What are the benchmarks and the numbers? What loss are you looking at? -- Andrei Just looking at the example, and referencing the map code in std.algorithm.iteration, I can see multiple function calls instead of one thanks to every indexing of the new map doing a transformation instead of caching it. I'm not sure if the lambda declaration there will result in the argument being taken by ref or by value, but let's assume by value for the sake of argument. Depending on if it's taking by value a reference or a value type, that could either be a cheap function call or an expensive one. But even if it took it by reference, it's still a function call. Function calls are generally The Devil(TM) in a gaming environment. The less you can make, the better. Random aside: There are streaming store instructions available to me on x86 platforms so that I don't have to wait for the destination to hit L1 cache before writing. The pattern Manu talks about with a batching function can better exploit this. But I imagine copy could also take advantage of this when dealing with value types. Understood. Would explicitly asking for vectorized operations be acceptable? One school of thought has it that explicit invocation of parallel operations are preferable to autovectorization and its ilk. -- Andrei
Re: Taking pipeline processing to the next level
On 5 September 2016 at 16:21, H. S. Teoh via Digitalmars-d wrote: > On Mon, Sep 05, 2016 at 03:08:53PM +1000, Manu via Digitalmars-d wrote: >> I mostly code like this now: >> data.map!(x => transform(x)).copy(output); >> >> It's convenient and reads nicely, but it's generally inefficient. >> This sort of one-by-one software design is the core performance >> problem with OOP. It seems a shame to be suffering OOP's failures even >> when there is no OOP in sight. > > In functional languages, where these ideas originated, it's mostly the > compiler's job to make things efficient. I'd argue that we should > rather be looking at how to improve the compiler. This is also the key reason that functional languages always seem to perform significantly slower than imperative languages. It might be possible, but when will it happen? How reliable is it if it does work? Does it make competing compilers impossible to implement? > As far as I know, > currently even D compilers don't really take full advantage of the > range idiom; i.e., their optimizers are written for the general case of > arbitrary code, so while you do get some benefits, it may not > necessarily translate to overall efficient code. Dmd is particularly > bad at optimizing range-based code, mainly because its inliner gives up > too easily. Nest the range API calls a level too deep, and the inliner > fails to inline it, and as a result a whole series of subsequent > optimizations are missed. Word. > IME, gdc does a bit better -- it's able to translate simple range > compositions into vectorized operations in some cases that I've seen. > Yes, I realize that there's some debate about whether auto-vectorization > is a good thing or not, but the point is that gdc's optimizer is well > able to translate what you call a one-by-one design into something > that's closer to block-by-block. So it's not out of the realm of > possibility that the compiler could do such things for you. Auto-vectorisation is a great optimisation, but it's not reliable. If you want to write reliably fast code, you need to do it manually, and that tends to mean implementing the hot loop by hand. It never hurts to hand-craft the hottest loops in your program; these are the exact sorts of functions I'm talking about. > Now imagine if the D frontend is able to recognize certain code patterns > characteristic of range-based code, and optimize accordingly. In the > front end, you have access to the AST and to higher-level constructs in > the language, so it's more likely to be able to reliably detect > range-based code, and do something useful with it. Conceptually, such > an optimizer could see code like this: > > data.map!(x => 2*x + 1) > .map!(x => f(x) - 1) > .copy(output); > > and produce, as an initial step, something like: > > foreach (i, x; data) { > output[i] = f(2*x + 1) - 1; > } > > Then this gets passed through the backend optimizer which does loop > optimizations like strength reduction, unrolling, etc.. The unrolled > loop could then get passed over another optimizer iteration to produce > vectorized code, or the equivalent transformation into block-based > operations. > > There's a lot of optimization potential with range-based code that > currently isn't really exploited by existing D compilers. It would be > interesting to investigate what optimizations are possible and to > implement them in the D front-end. I agree, there are heaps of optimisations that could be done, but there are also a lot of optimisations that can not be performanced automatically. There are many cases where the user will need to hand write the `f(T[] data, T[] dest)` version, and we need to find a way that such a function can work its way into these pipeline code constructs naturally. >> A central premise of performance-oriented programming which I've >> employed my entire career, is "where there is one, there is probably >> many", and if you do something to one, you should do it to many. With >> this in mind, the code I have always written doesn't tend to look like >> this: >> R manipulate(Thing thing); >> >> Instead: >> void manipulateThings(Thing *things, size_t numThings, R *output, >> size_t outputLen); >> >> Written this way for clarity. Obviously, the D equiv uses slices. > > Actually, isn't this the whole premise of range-based code? Instead of > dealing with individual items, you work with a range of items, and not > necessarily arrays either. Though for obvious reasons arrays would tend > to perform better than the general range. Absolutely, but we're not taking advantage of that at all yet. All expressions are element-wise, and that alone inhibits many possibilities for writing efficient code. Again, we need to find a way to comfortably work those batch function into the pipeline. It needs to be just as valid to write a batch version and have it work in the same place as an element-wise version, and the s
Re: Taking pipeline processing to the next level
On 5 September 2016 at 18:21, Andrei Alexandrescu via Digitalmars-d wrote: > On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote: >> >> I mostly code like this now: >> data.map!(x => transform(x)).copy(output); >> >> It's convenient and reads nicely, but it's generally inefficient. > > > What are the benchmarks and the numbers? What loss are you looking at? -- > Andrei Well, it totally depends. Like right now in my case, 'transform' is some image processing code (in the past when I've had these same thoughts, it has been audio filters). You can't touch pixels (or samples) one at a time. They need manual SIMD deployment (I've never seen an auto-vectoriser handle saturation arithmetic, or type promotion), alpha components (every 4th byte) is treated differently, memory access patterns need to be tuned to be cache friendly. I haven't done benchmarks right now, but I've done them professionally in the past, and it's not unusual to expect a hand-written image processing loop to see 1 or even 2 orders of magnitude improvement when hand written, compared to calling a function for each pixel in a loop. The sorts of low-level optimisations you deploy in image and audio processing loops are not things I've ever seen any optimiser even attempt. Some core problems that tend to require manual intervention in hot-loops are: ubyte[16] <-> ushort[8][2] expansion/contraction ubyte[16] <-> float[4][4] expansion/contraction saturation scalar operator results promote to int, but wide-simd operations don't, which means some scalar expressions can't express losslessly collapsed to simd operations, and the compiler will always be conservative on this matter. If the auto-vectoriser tries at all, you will see a mountain of extra code to preserve those bits that the scalar operator semantics would have guaranteed wide-vector multiplication is semantically different than scalar multiplication, so the optimiser has a lot of trouble vectorising mul's assumptions about data alignment interleaved data; audio samples are usually [L,R] interleaved, images often [RGB,A], and different processes are applied across the separation, you want to unroll and shuffle the data so you have vectors [],[], or [RGBRGBRGBRGB],[], and I haven't seen an optimiser go near that vector dot-product is always a nuisance I could go on and on. The point is, as an end-user, pipeline API's are great. At a library author, I want to present the best performing library I can, which I think means we need to find a way to conveniently connect these 2 currently disconnected worlds. I've explored to some extent, but I've never come up with anything that I like.
Re: Taking pipeline processing to the next level
On 5 September 2016 at 20:32, Johannes Pfau via Digitalmars-d wrote: > Am Mon, 5 Sep 2016 10:21:53 +0200 > schrieb Andrei Alexandrescu : > >> On 9/5/16 7:08 AM, Manu via Digitalmars-d wrote: >> > I mostly code like this now: >> > data.map!(x => transform(x)).copy(output); >> > >> > It's convenient and reads nicely, but it's generally inefficient. >> >> What are the benchmarks and the numbers? What loss are you looking >> at? -- Andrei > > As Manu posted this question (and he's working on a color/image library) > it's not hard to guess one problem is SIMD/vectorization. E.g if > transform(x) => x + 2; It is faster to perfom 1 SIMD operation on 4 > values instead of 4 individual adds. > > As autovectorization is not very powerful in current compilers I can > easily imagine that complex range based examples can't compete with > hand-written SIMD loops. > > @Manu: Have you had a look at std.parallelism? I think it has some kind > of parallel map which could provide some inspiration? I have, but even just chunks() and joiner() can do the trick to a reasonable extent, but it's still not great. It's definitely not where I'd like it to be. End-users won't manually deploy these strategies correctly (or at all), I'd like to see design that enables more automatic deployment of batch processing. I treat the end-user like a javascript user; they shouldn't need to do hard work to make proper use of a lib, that's poor API offering on part of the lib author.
Re: Taking pipeline processing to the next level
On 5 September 2016 at 23:38, Andrei Alexandrescu via Digitalmars-d wrote: > On 9/5/16 1:43 PM, Ethan Watson wrote: >> >> On Monday, 5 September 2016 at 08:21:53 UTC, Andrei Alexandrescu wrote: >>> >>> What are the benchmarks and the numbers? What loss are you looking at? >>> -- Andrei >> >> >> Just looking at the example, and referencing the map code in >> std.algorithm.iteration, I can see multiple function calls instead of >> one thanks to every indexing of the new map doing a transformation >> instead of caching it. I'm not sure if the lambda declaration there will >> result in the argument being taken by ref or by value, but let's assume >> by value for the sake of argument. Depending on if it's taking by value >> a reference or a value type, that could either be a cheap function call >> or an expensive one. >> >> But even if it took it by reference, it's still a function call. >> Function calls are generally The Devil(TM) in a gaming environment. The >> less you can make, the better. >> >> Random aside: There are streaming store instructions available to me on >> x86 platforms so that I don't have to wait for the destination to hit L1 >> cache before writing. The pattern Manu talks about with a batching >> function can better exploit this. But I imagine copy could also take >> advantage of this when dealing with value types. > > > Understood. Would explicitly asking for vectorized operations be acceptable? > One school of thought has it that explicit invocation of parallel operations > are preferable to autovectorization and its ilk. -- Andrei I still stand by this, and I listed some reasons above. Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are different than vector semantics, and auto-vectorisation tends to produce a whole bunch of extra junk code to carefully (usually pointlessly) preserve the scalar semantics that it's trying to vectorise. This will never end well. But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their way. If they even have to start manually chunking, I think we've already lost; they won't know optimal chunk-sizes, or anything about alignment boundaries, cache, etc.
Re: Taking pipeline processing to the next level
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote: I mostly code like this now: data.map!(x => transform(x)).copy(output); So you basicly want to make the lazy computation eager and store the result? data.map!(x => transform(x)).array Will allocate a new array and fill it with the result of map. And if you want to recycle the buffer I guess writing a buffer function would be trivial.
Re: Taking pipeline processing to the next level
On Tuesday, 6 September 2016 at 03:08:43 UTC, Manu wrote: I still stand by this, and I listed some reasons above. Auto-vectorisation is a nice opportunistic optimisation, but it can't be relied on. The key reason is that scalar arithmetic semantics are different than vector semantics, and auto-vectorisation tends to produce a whole bunch of extra junk code to carefully (usually pointlessly) preserve the scalar semantics that it's trying to vectorise. This will never end well. But the vectorisation isn't the interesting problem here, I'm really just interested in how to work these batch-processing functions into our nice modern pipeline statements without placing an unreasonable burden on the end-user, who shouldn't be expected to go out of their way. If they even have to start manually chunking, I think we've already lost; they won't know optimal chunk-sizes, or anything about alignment boundaries, cache, etc. In a previous job I had successfully created a small c++ library to perform pipelined SIMD image processing. Not sure how relevant it is but think I'd share the design here, perhaps it'll give you guys some ideas. Basically the users of this library only need to write simple kernel classes, something like this: // A kernel that processes 4 pixels at a time struct MySimpleKernel : Kernel<4> { // Tell the library the input and output type using InputVector = Vector<__m128, 1>; using OutputVector = Vector<__m128, 2>; template OutputVector apply(const T& src) { // T will be deduced to Vector<__m128, 1> // which is an array of one __m128 element // Awesome SIMD code goes here... // And return the output vector return OutputVector(...); } }; Of course the InputVector and OutputVector do not have to be __m128, they can totally be other types like int or float. The cool thing is kernels can be chained together with >> operators. So assume we have another kernel: struct AnotherKernel : Kernel<3> { ... } Then we can create a processing pipeline with these 2 kernels: InputBuffer(...) >> MySimpleKernel() >> AnotherKernel() >> OutputBuffer(...); Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. Any number of kernels can be chained together in this way, as long as your compiler doesn't explode. At that time, my benchmarks showed pipelines generated in this way often rivals the speed of hand tuned loops.
Re: Taking pipeline processing to the next level
On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote: Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times.
Re: Taking pipeline processing to the next level
On Tuesday, 6 September 2016 at 14:26:22 UTC, finalpatch wrote: On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote: Then some template magic will figure out the LCM of the 2 kernels' pixel width is 3*4=12 and therefore they are fused together into a composite kernel of pixel width 12. The above line compiles down into a single function invokation, with a main loop that reads the source buffer in 4 pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. Correction: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. And of course the key to the speed is all function calls get inlined by the compiler.
Re: Taking pipeline processing to the next level
On 7 September 2016 at 00:26, finalpatch via Digitalmars-d wrote: > On Tuesday, 6 September 2016 at 14:21:01 UTC, finalpatch wrote: >> >> Then some template magic will figure out the LCM of the 2 kernels' pixel >> width is 3*4=12 and therefore they are fused together into a composite >> kernel of pixel width 12. The above line compiles down into a single >> function invokation, with a main loop that reads the source buffer in 4 >> pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. > > > Correction: > with a main loop that reads the source buffer in *12* pixels step, call > MySimpleKernel 3 times, then call AnotherKernel 4 times. It's interesting thoughts. What did you do when buffers weren't multiple of the kernels?
Re: Taking pipeline processing to the next level
On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote: A central premise of performance-oriented programming which I've employed my entire career, is "where there is one, there is probably many", and if you do something to one, you should do it to many. From a conceptual standpoint, this sounds like the sort of thing array languages like APL and J thrive on, so there's solid precedent for the concept. I might suggest looking into optimising compilers in that space for inspiration and such; APEX, for example: http://www.snakeisland.com/apexup.htm Of course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors. -Wyatt
Re: Taking pipeline processing to the next level
On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote: with a main loop that reads the source buffer in *12* pixels step, call MySimpleKernel 3 times, then call AnotherKernel 4 times. It's interesting thoughts. What did you do when buffers weren't multiple of the kernels? The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer.
Re: Taking pipeline processing to the next level
On 7 September 2016 at 01:54, Wyatt via Digitalmars-d wrote: > On Monday, 5 September 2016 at 05:08:53 UTC, Manu wrote: >> >> >> A central premise of performance-oriented programming which I've >> employed my entire career, is "where there is one, there is probably >> many", and if you do something to one, you should do it to many. > > > From a conceptual standpoint, this sounds like the sort of thing array > languages like APL and J thrive on, so there's solid precedent for the > concept. I might suggest looking into optimising compilers in that space > for inspiration and such; APEX, for example: > http://www.snakeisland.com/apexup.htm Thanks, that's really interesting, I'll check it out. > Of course, this comes with the caveat that this is (still!) some relatively > heavily-academic stuff. And I'm not sure to what extent that can help > mitigate the problem of relaxing type requirements such that you can e.g. > efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors. That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently...
Re: Taking pipeline processing to the next level
On 7 September 2016 at 07:11, finalpatch via Digitalmars-d wrote: > On Tuesday, 6 September 2016 at 14:47:21 UTC, Manu wrote: > >>> with a main loop that reads the source buffer in *12* pixels step, call >>> MySimpleKernel 3 times, then call AnotherKernel 4 times. >> >> >> It's interesting thoughts. What did you do when buffers weren't multiple >> of the kernels? > > > The end of a scan line is special cased . If I need 12 pixels for the last > iteration but there are only 8 left, an instance of Kernel::InputVector is > allocated on stack, 8 remaining pixels are memcpy into it then send to the > kernel. Output from kernel are also assigned to a stack variable first, then > memcpy 8 pixels to the output buffer. Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently.
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 00:21:23 UTC, Manu wrote: The end of a scan line is special cased . If I need 12 pixels for the last iteration but there are only 8 left, an instance of Kernel::InputVector is allocated on stack, 8 remaining pixels are memcpy into it then send to the kernel. Output from kernel are also assigned to a stack variable first, then memcpy 8 pixels to the output buffer. Right, and this is a classic problem with this sort of function; it is only more efficient if numElements is suitable long. See, I often wonder if it would be worth being able to provide both functions, a scalar and array version, and have the algorithms select between them intelligently. We normally process full HD or higher resolution images so the overhead of having to copy the last iteration was negligible. It was fairly easy to put together a scalar version as they are much easier to write than the SIMD ones. In fact I had scalar version for every SIMD kernel, and use them for unit testing. It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it.
Re: Taking pipeline processing to the next level
On 7 September 2016 at 11:04, finalpatch via Digitalmars-d wrote: > > It shouldn't be hard to have the framework look at the buffer size and > choose the scalar version when number of elements are small, it wasn't done > that way simply because we didn't need it. No, what's hard is working this into D's pipeline patterns seamlessly.
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote: On 7 September 2016 at 11:04, finalpatch via Digitalmars-d wrote: It shouldn't be hard to have the framework look at the buffer size and choose the scalar version when number of elements are small, it wasn't done that way simply because we didn't need it. No, what's hard is working this into D's pipeline patterns seamlessly. The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines.
Re: Taking pipeline processing to the next level
On 7 September 2016 at 12:00, finalpatch via Digitalmars-d wrote: > On Wednesday, 7 September 2016 at 01:38:47 UTC, Manu wrote: >> >> On 7 September 2016 at 11:04, finalpatch via Digitalmars-d >> wrote: >>> >>> >>> It shouldn't be hard to have the framework look at the buffer size and >>> choose the scalar version when number of elements are small, it wasn't done >>> that way simply because we didn't need it. >> >> >> No, what's hard is working this into D's pipeline patterns seamlessly. > > > The lesson I learned from this is that you need the user code to provide a > lot of extra information about the algorithm at compile time for the > templates to work out a way to fuse pipeline stages together efficiently. > > I believe it is possible to get something similar in D because D has more > powerful templates than C++ and D also has some type introspection which C++ > lacks. Unfortunately I'm not as good on D so I can only provide some ideas > rather than actual working code. > > Once this problem is solved, the benefit is huge. It allowed me to perform > high level optimizations (streaming load/save, prefetching, dynamic > dispatching depending on data alignment etc.) in the main loop which > automatically benefits all kernels and pipelines. Exactly!
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 02:09:17 UTC, Manu wrote: The lesson I learned from this is that you need the user code to provide a lot of extra information about the algorithm at compile time for the templates to work out a way to fuse pipeline stages together efficiently. I believe it is possible to get something similar in D because D has more powerful templates than C++ and D also has some type introspection which C++ lacks. Unfortunately I'm not as good on D so I can only provide some ideas rather than actual working code. Once this problem is solved, the benefit is huge. It allowed me to perform high level optimizations (streaming load/save, prefetching, dynamic dispatching depending on data alignment etc.) in the main loop which automatically benefits all kernels and pipelines. Exactly! I think the problem here is two fold. First question, how do we combine pipeline stages with minimal overhead I think the key to this problem is reliable *forceinline* for example, a pipeline like this input.map!(x=>x.f1().f2().f3().store(output)); if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined. Second question, how do we combine SIMD pipeline stages with minimal overhead Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this. I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity.
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 10:31:13 UTC, finalpatch wrote: I think the problem here is two fold. First question, how do we combine pipeline stages with minimal overhead I think the key to this problem is reliable *forceinline* for example, a pipeline like this input.map!(x=>x.f1().f2().f3().store(output)); if we could make sure f1(), f2(), f3(), store(), and map() itself are all inlined, then we end up with a single loop with no function calls and the compiler is free to perform cross function optimizations. This is about as good as you can get. Unfortunately at the moment I hear it's difficult to make sure D functions get inlined. If the compiler is unable to inline (or wrongly decides it is too costly), I'd consider this a compiler bug. Of course, sometimes workarounds like `pragma(inline, true)` or `@forceinline` might be needed from time to time in practice, but they shouldn't influence the design of the pipeline interface. Second question, how do we combine SIMD pipeline stages with minimal overhead Besides reliable inlining, we also need some template code to repeat stages until their strides match. This requires details about each stage's logical unit size, input/output type and size at compile time. I can't think of what the interface of this would look like but the current map!() is likely insufficient to support this. Would a `vectorize` range adapter be feasible that prepares the input to make it SIMD compatible? That is, force alignment, process left-over elements at the end, etc.? As far as I understand, the problems with auto vectorization stem from a difficulty of compilers to recognize vectorizing opportunities, and (as Manu described) from incompatible semantics of scalar and vector types that the compiler needs to preserve. But if that hypothetical `vectorize` helper forces the input data into one of a handful of well-known formats and types, wouldn't it be possible to make the compilers recognize those (especially if they are accompanied by suitable pragma or other compiler hints)? I still don't believe auto-select between scalar or vector paths would be a very useful feature. Normally I would only consider SIMD solution when I know in advance that this is a performance hotspot. When the amount of data is small I simply don't care about performance and would just choose whatever simplest way to do it, like map!(), because the performance impact is not noticeable and definitely not worth the increased complexity. In the above scenario, you can add `.vectorize` to the pipeline to enable vectorizing wherever you need it.
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 11:53:00 UTC, Marc Schütz wrote: Would a `vectorize` range adapter be feasible that prepares the input to make it SIMD compatible? That is, force alignment, process left-over elements at the end, etc.? As far as I understand, the problems with auto vectorization stem from a difficulty of compilers to recognize vectorizing opportunities, and (as Manu described) from incompatible semantics of scalar and vector types that the compiler needs to preserve. But if that hypothetical `vectorize` helper forces the input data into one of a handful of well-known formats and types, wouldn't it be possible to make the compilers recognize those (especially if they are accompanied by suitable pragma or other compiler hints)? Contrary to popular belief, alignment is not a showstopper of SIMD code. Both Intel and ARM processors have instructions to access data from unaligned addresses. And on Intel processors, there is not even any speed penalty for using them on aligned addresses. Which means you can either forget it (on Intel) or just check the data alignment before you start and choose an optimal specialization of the main loop. However regarding auto vectorization, I'm with Manu. I won't put my bet on auto vectorization because I have never seen any non-trivial auto vectorized code that comes even close to hand tuned SIMD code. The compiler always have to play conservatively. The compiler has no idea that you are only using 10 bits of each 16bit components in a vector. It can't even help you shuffle RGBARGBARGBARGBA into . The best we can do is to create something that makes writing SIMD kernels easy/reusable/composable.
Re: Taking pipeline processing to the next level
On Wednesday, 7 September 2016 at 00:18:59 UTC, Manu wrote: On 7 September 2016 at 01:54, Wyatt via Digitalmars-d wrote: Thanks, that's really interesting, I'll check it out. Here's some work on static rank polymorphism that might also be applicable?: http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf And in the Related Work, I just noticed Halide, which sounds like it's right up your alley: http://halide-lang.org/ Of course, this comes with the caveat that this is (still!) some relatively heavily-academic stuff. And I'm not sure to what extent that can help mitigate the problem of relaxing type requirements such that you can e.g. efficiently ,/⍉ your 4 2⍴"LR" vector for SIMD on modern processors. That's not what I want though. I intend to hand-write that function (I was just giving examples of how auto-vectorisation almost always fails), the question here is, how to work that new array function into our pipelines transparently... Ah, I misunderstood. Sorry. I had the impression that you wanted to be able to simply write: data.map!(x => transform(x)).copy(output); ...for any data[] and have it lift the transformation to the whole vector. If you're doing the work, I'm curious what you're hoping the end result to look like in terms of the code you want to be able to write. Just a doodle is fine, it doesn't have to stand up to scrutiny. -Wyatt