Re: Vapoursynth - optimization

2020-04-03 Thread mantielero
It has improved a bit. Regarding the Nim filter:

  * 308fps by tuning seq[seq[int]], into array[9,int32]
  * 442fps by changing the bound checks into min( max(val, 0), max_val)



Still slow but getting better.


import ../vapoursynth
import math

template clamp(val:int, max_val:int):untyped =
   min( max(val, 0), max_val)

proc apply_kernel*(src:ptr VSFrameRef, dst:ptr VSFrameRef, kernel:array[9, 
int32], mul:int, den:int) =
   let fi = API.getFrameFormat(src)  # Format information
   let n = (( math.sqrt(kernel.len.float).int - 1 ) / 2).int
   for i in 0..

Re: Vapoursynth - optimization

2020-04-03 Thread mratsim
This is the source of your slowness


let kernel = @[@[1, 2, 1],
   @[2, 4, 2],
   @[1, 2, 1] ]


Run

Instead you should use


let kernel = [[1, 2, 1],
   [2, 4, 2],
   [1, 2, 1]]


Run

You absolutely want all your frequently used data to be laid out contiguously 
in memory, `seq[seq[T]]` is the worse thing you can do for speed.

On current CPUs, the bottleneck in image processing code is the memory 
accesses, optimize memory accesses and you will be fast.

Also you want to use `ptr UncheckedArray` to avoid the bound-checks on every 
array access.

A basic convolution would be written this way: 
[https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73)
 (note that this is a convolution for a batch of images in NCHW format for N 
batch, C color channels, H height, W width)

In terms of performance it reaches about 1.5% of the theoretical peak.

Further speed improvements are much more involved, you can reach about 10% of 
theoretical maximum by implementing your convolution using matrix 
multiplication from an optimized BLAS library: 
[https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_im2col.nim).

Achieving about 80%~90% of theoretical maximum is something I didn't manage yet 
but the steps are detailed there: 
[https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md)
 in particular in the Intel paper Anatomy of High-performance Deep Learning 
Convolutions on SIMD Architecture (2018-08).

Looking into the C++ code, it looks like a naive implementation that would use 
about 1.5% of theoretical peak.

For the absolute max performance, you can reach over 110% of the theoretical 
peak by using Winograd convolutions as your kernel is of shape 3x3. Winograd 
convolution "cheats" by using this special 3x3 form to avoid doing useless 
operations hence going over 100% of the theoretical CPU peak, small high-level 
explanation here: 
[https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff](https://blog.usejournal.com/understanding-winograd-fast-convolution-a75458744ff)

I've planned to write an image processing and deep learning compiler to make it 
easy to write high-performance deep learning and image processing code but not 
enough time :/. 


Re: Vapoursynth - optimization

2020-04-03 Thread mantielero
**Working with C filters**

When I work with C filters (like Convolution) I works fine with refc GC. In 
that case, I don't know why I get 2640fps vs 3740fps on the python version. 
They should be roughly the same. When I call a filter within an existing plugin 
I just call one function:


...
result = API.invoke(plug, "Convolution".cstring, args)

Run

I don't do any calculation with the frames. The only thing that should be 
different with regard to python, is that in order to test the filter, they use: 
vspipe. I created a Null filter in pure Nim which is really simple (just ask 
for a frame -a pointer- and frees it):


proc Null*(vsmap:ptr VSMap) =
  let node = getFirstNode(vsmap)
  let vinfo = API.getVideoInfo(node) # video info pointer
  for i in 0..

Re: Vapoursynth - optimization

2020-04-02 Thread Stefan_Salewski
> At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow

Please note that your first task would be to make it work with default refc GC 
and with arc GC.

\--gc:none was only suggested for tests, as you reported that it crash with 
refc GC.|   
---|---  
  
For performance you would never use a seq[seq] for your filter matrix, but a 
continues block of memory which lives in the cache permanently. And you would 
like to use SIMD instructions. For SIMD you may try to code it manually, I 
think we have a simd module provided by mratsim, or you would write Nim code 
that can be converted by the Nim compiler in something for which the C compiler 
can apply SIMD instructions.


Re: Vapoursynth - optimization

2020-04-02 Thread mantielero
At this stage I can only recommend NOT to use VapourSynth.nim. It is too slow 
(not because of nim, but I cannot find out why).

Testing the convolution filter in Python:


import vapoursynth as vs
core = vs.get_core()
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=10, fpsnum=24000, 
fpsden=1001, keep=True)
clip = core.std.Convolution(clip, matrix=[1,2,1,2,4,2,1,2,1])
clip.set_output()

Run

So I get:


$ vspipe test.vpy /dev/null
Output 10 frames in 26.73 seconds (3740.91 fps)

My version:


import ../vapoursynth
import options
BlankClip( format=pfGrayS.int.some,
   width=640.some,
   height=480.some,
   length=10.some,
   fpsnum=24000.some,
   fpsden=1001.some, 
keep=1.some).Convolution(@[1.0,2.0,1.0,2.0,4.0,2.0,1.0,2.0,1.0]).Null

Run

so:


$ nim c -f --gc:none -d:release -d:danger modifyframe
$ time ./modifyframe

real0m37,872s
user0m38,989s
sys 0m1,997s

which is: 2640.47fps

On the other hand you can create your own filters. In that regard, I have 
managed to apply a simple Gauss filter to 10frames in:


$ time ./modifyframe

real8m25,425s
user8m24,112s
sys 0m5,422s

which is 198fps. Way too slow when compared with the C++ version.


Re: Vapoursynth - optimization

2020-03-31 Thread doofenstein
I would prefer UncheckedArray, since it's integrated directly into Nim and also 
because I personally it's semantics.

If you want to expose an API in a safer way UncheckedArrays also have the 
advantage that they can be 
[cast](https://nim-lang.org/docs/system.html#toOpenArray%2Cptr.UncheckedArray%5BT%5D%2Cint%2Cint)
 into openArrays


Re: Vapoursynth - optimization

2020-03-31 Thread mantielero
Let me try asking less general questions:

What is preferred, using UncheckedArray, using [these 
templates](https://forum.nim-lang.org/t/1188#7366)? I guess all this is better 
than performing a memcopy into a Nim structure.


Re: Vapoursynth - optimization

2020-03-30 Thread mantielero
Just for the record the C++ version is 
[here](https://github.com/IFeelBloated/vsFilterScript/blob/master/GaussBlur.hxx#L23)
 using templates.

And this is by using the [low level 
API](https://github.com/IFeelBloated/test_c_filters/blob/master/GaussBlur.cxx#L26).