Re: Vapoursynth - optimization
It has improved a bit. Regarding the Nim filter: * 308fps by tuning seq[seq[int]], into array[9,int32] * 442fps by changing the bound checks into min( max(val, 0), max_val) Still slow but getting better. import ../vapoursynth import math template clamp(val:int, max_val:int):untyped = min( max(val, 0), max_val) proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, int32], mul:int, den:int) = let fi = API.getFrameFormat(src) # Format information let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int for i in 0..
Re: Vapoursynth - optimization
This is the source of your slowness let kernel = @[@[1, 2, 1], @[2, 4, 2], @[1, 2, 1] ] Run Instead you should use let kernel = [[1, 2, 1], [2, 4, 2], [1, 2, 1]] Run You absolutely want all your frequently used data to be laid out contiguously in memory, `seq[seq[T]]` is the worse thing you can do for speed. On current CPUs, the bottleneck in image processing code is the memory accesses, optimize memory accesses and you will be fast. Also you want to use `ptr UncheckedArray` to avoid the bound-checks on every array access. A basic convolution would be written this way: [https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73) (note that this is a convolution for a batch of images in NCHW format for N batch, C color channels, H height, W width) In terms of performance it reaches about 1.5% of the theoretical peak. Further speed improvements are much more involved, you can reach about 10% of theoretical maximum by implementing your convolution using matrix multiplication from an optimized BLAS library: [https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim). Achieving about 80%~90% of theoretical maximum is something I didn't manage yet but the steps are detailed there: [https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md) in particular in the Intel paper Anatomy of High-performance Deep Learning Convolutions on SIMD Architecture (2018-08). Looking into the C++ code, it looks like a naive implementation that would use about 1.5% of theoretical peak. For the absolute max performance, you can reach over 110% of the theoretical peak by using Winograd convolutions as your kernel is of shape 3x3. Winograd convolution "cheats" by using this special 3x3 form to avoid doing useless operations hence going over 100% of the theoretical CPU peak, small high-level explanation here: [https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff](https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff) I've planned to write an image processing and deep learning compiler to make it easy to write high-performance deep learning and image processing code but not enough time :/.
Re: Vapoursynth - optimization
**Working with C filters** When I work with C filters (like Convolution) I works fine with refc GC. In that case, I don't know why I get 2640fps vs 3740fps on the python version. They should be roughly the same. When I call a filter within an existing plugin I just call one function: ... result = API.invoke(plug, "Convolution".cstring, args) Run I don't do any calculation with the frames. The only thing that should be different with regard to python, is that in order to test the filter, they use: vspipe. I created a Null filter in pure Nim which is really simple (just ask for a frame -a pointer- and frees it): proc Null*(vsmap:ptr VSMap) = let node = getFirstNode(vsmap) let vinfo = API.getVideoInfo(node) # video info pointer for i in 0..
Re: Vapoursynth - optimization
> At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow Please note that your first task would be to make it work with default refc GC and with arc GC. \--gc:none was only suggested for tests, as you reported that it crash with refc GC.| ---|--- For performance you would never use a seq[seq] for your filter matrix, but a continues block of memory which lives in the cache permanently. And you would like to use SIMD instructions. For SIMD you may try to code it manually, I think we have a simd module provided by mratsim, or you would write Nim code that can be converted by the Nim compiler in something for which the C compiler can apply SIMD instructions.
Re: Vapoursynth - optimization
At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow (not because of nim, but I cannot find out why). Testing the convolution filter in Python: import vapoursynth as vs core = vs.get_core() core.std.SetMaxCPU('none') clip = core.std.BlankClip(format=vs.GRAYS, length=10, fpsnum=24000, fpsden=1001, keep=True) clip = core.std.Convolution(clip, matrix=[1,2,1,2,4,2,1,2,1]) clip.set_output() Run So I get: $ vspipe test.vpy /dev/null Output 10 frames in 26.73 seconds (3740.91 fps) My version: import ../vapoursynth import options BlankClip( format=pfGrayS.int.some, width=640.some, height=480.some, length=10.some, fpsnum=24000.some, fpsden=1001.some, keep=1.some).Convolution(@[1.0,2.0,1.0,2.0,4.0,2.0,1.0,2.0,1.0]).Null Run so: $ nim c -f --gc:none -d:release -d:danger modifyframe $ time ./modifyframe real0m37,872s user0m38,989s sys 0m1,997s which is: 2640.47fps On the other hand you can create your own filters. In that regard, I have managed to apply a simple Gauss filter to 10frames in: $ time ./modifyframe real8m25,425s user8m24,112s sys 0m5,422s which is 198fps. Way too slow when compared with the C++ version.
Re: Vapoursynth - optimization
I would prefer UncheckedArray, since it's integrated directly into Nim and also because I personally it's semantics. If you want to expose an API in a safer way UncheckedArrays also have the advantage that they can be [cast](https://nim-lang.org/docs/system.html#toOpenArray%2Cptr.UncheckedArray%5BT%5D%2Cint%2Cint) into openArrays
Re: Vapoursynth - optimization
Let me try asking less general questions: What is preferred, using UncheckedArray, using [these templates](https://forum.nim-lang.org/t/1188#7366)? I guess all this is better than performing a memcopy into a Nim structure.
Re: Vapoursynth - optimization
Just for the record the C++ version is [here](https://github.com/IFeelBloated/vsFilterScript/blob/master/GaussBlur.hxx#L23) using templates. And this is by using the [low level API](https://github.com/IFeelBloated/test_c_filters/blob/master/GaussBlur.cxx#L26).