# Parallelism

Parallelism is a hard problem and it really depends on what kind of parallelism 
you need.

For my problem domain (numerical computing, machine learning, deep learning), 
Nim macros + OpenMP offer me unparalleled flexibility. For example I can reach 
the speed of OpenBLAS which is a matrix multiplication library that has been 
tuned for 10+ years and is coming from 20+ years of research on optimisations.

I do plan to carry this success to convolutions which are still lagging on CPU. 
I have to benchmark actual implementations but naive convolutions are 
incredibly slow and would need to be optimised about 40x to reach 80% of your 
theoretical CPU peak performance (only a few operations like matrix 
multiplication and convolution can saturate CPU theoretical performance, 99% of 
the others are bounded by memory bandwidth). Convolutions are used every time 
you need to blur, sharpen, edge enhance or detect on images and are keys to 
image, sound, speech perception in deep learning.

C, C++, Fortran can also use OpenMP but are severely lacking in dev 
productivity and the potential to give a nice high level wrapper.

# Replacing or expanding

Furthermore there is no need to "replace". There are problem domains with no 
established language especially for production. (What strategy consulting 
called `blue ocean`, don't compete where there is a lot of fishes, migrate to a 
new uncharted place)

Interestingly in the two I'm thinking of research is done in Python.

## Blockchain

The first one is blockchain, beyond Status others have been using Nim due to 
it's easy interface with C++ and translation of Python research:

  * [EmberCrypto](https://github.com/EmberCrypto/Ember)
  * [KIP Foundation](https://github.com/KIPFoundation/nim-ewasm-contracts)



## Reinforcement learning

The second one is reinforcement learning. Everything is done in Python and 
there is no standard way to produce a generic and ship an AI that can play 
[platformers for example](https://www.youtube.com/watch?v=qv6UVOQ0F44) (Note 
that this is quite different from deep learning as with reinforcement learning 
we don't know the correct solution and the neural network in video is also 
different in a fundamental way).

Contrary to traditional machine learning and deep learning, not every languages 
under the sun has a graveyard of failed reinforcement learning projects.

Furthermore while most languages have matrix libraries, most compiled languages 
do not have a basic 4-dimensional tensor library which is needed to go beyond 
simple statistical or evolutionary reinforcement learning (think genetic 
algorithm) and add visual perception to the mix.

And lastly, the only way to check reinforcement learning successes is on 
controlled experiments. It is easy to implement toy examples, but advanced 
examples basically need emulator bindings and most emulators are written in 
C++. For example I easily wrapped the [Arcade Learning 
Environment](https://github.com/mgbellemare/Arcade-Learning-Environment) to do 
controlled experiments on Atari games from [C++ to 
Nim](https://github.com/numforge/agent-smith).

## GPU computing

One note on GPU computing: Cuda, OpenCL, AMD ROCm, Vulkan Compute are very to 
deal with, I'm not sure if it is also the case with OpenGL, DirectX but being 
able to produce C or C++ code is a killer advantage to abstract away all the 
GPU mess.

## JIT, VMs and interpreters

* * *

I've implemented several VMs in the past year ([Nimbus 
VM](https://github.com/status-im/nimbus/blob/6a24701bbf0dab12ddbcf76560ecf3f745429823/nimbus/vm/interpreter_dispatch.nim#L22-L36)
 for blockchain, [Glyph for SuperNes emulation 
(incomplete)](https://github.com/mratsim/glyph/blob/8b278c5e76c3f1053a196173a93686afda0596cc/glyph/snes/opcodes.nim#L16-L32)
 and [Photon 
JIT](https://github.com/numforge/laser/blob/9fbb8d2a573d950573c7249e3a5d6cdd784a639e/laser/photon_jit/x86_64/x86_64_ops.nim#L24-L51)
 for x86_64 JIT Assembler) and I don't see any language competing with Nim in 
this space.

Thanks to metaprogramming, opcodes mapping is a breeze and you can cleanly 
separate your [dispatch 
technique](https://github.com/status-im/nimbus/wiki/Interpreter-optimization-resources)
 with your opcode implementation while avoiding function call/vtable overhead 
which kills your cache.

Reply via email to