Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Stéphane Letz
> 
> Message: 9
> Date: Mon, 04 Oct 2010 13:51:07 +0200
> From: Max Tandetzky 
> Subject: [LAD] CUDA implementation for calf
> To: linux-audio-dev@lists.linuxaudio.org
> Message-ID: <4ca9bfab.2090...@uni-jena.de>
> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> 
> Hallo,
> 
> I am new here, so I hope this is the right place to talk about what I 
> want to do.
> I want to make a CUDA implementation of the algorithms from the 
> calf-plugins. On the front end there should be placed a button (or 
> something else) to (de-)activate the CUDA support. I have already 
> written a jack-program (which makes some simple changes on audio data) 
> using CUDA. It works good and at a first sight the performance looks 
> promisingly.
> 
> I have read a part of the mailing list archive and found out that there 
> already was a discussion about audio processing with CUDA. I know there 
> are some reasons for not using CUDA like the duty to use the proprietary 
> Nvida driver, the limitation that only people who have an Nvidia card 
> will have a benefit and so on. But the CUDA implementation may show 
> which performance can be reached and may beuseful for Nvidia users 
> immediately.
> I know there is OpenCL but it is not as sophisticated as CUDA at the 
> moment, will have less performance than CUDA and I do not have the time 
> to learn OpenCl at the moment (but the project has to be finished soon). 
> I heard it is not too much work to transfer existing CUDA code to OpenCl 
> code later (assuming there is already an OpenCL implementation for all 
> CUDA functions which were used).
> So I want to do this with CUDA.
> 
> At the moment I have some questions:
> 1. Is there anybody has already done or is doing something like this?
> 2. Where can I get information to make any specific changes on the calf 
> code? (I examined it a bit but it will take time to understand the 
> structure of the program when I only have the code, especially the part 
> for the GUI seems to be conceptualized a bit more complex.)
> 
> It would be nice if I can get some help here.
> 
> Regards
> 
> Max Tandetzky

Hi Max,

I've done some test using OpenCL in the context of the Faust project 
(http://faust.grame.fr/). Up to now results are not really good, and I guess 
CUDA/OpenCL will be usable only in specific cases. I'll probably now test if 
directly using CUDA would give some benefit.  Maybe we can share some ideas?

Stéphane 





___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Jens M Andreasen

On Tue, 2010-10-12 at 16:29 +0200, Stéphane Letz wrote:
> > 

> I've done some test using OpenCL in the context of the Faust project
> (http://faust.grame.fr/). Up to now results are not really good, and I
> guess CUDA/OpenCL will be usable only in specific cases. 

What kinds of parallellism have you been exploring? 

I have found that the multiple channel strip approach with mixdown to
subgroups is straight-forward for a DAW, as well as for polysynths.

Global sync points between multiprocessors - for cascaded processing -
works up to a limit after which the squared cost of the sync eats up the
computational value of the added multiprocessor. I am syncing the 6 MP's
on a GT220 every 16 samples at 96kHz with a penalty in the 5% range. On
higher end cards this approach is not very useful though. Discouraged by
Nvidia staffers also ...


In theory, the granularity of the vector needs to be no higher than 32
vector elements to be efficient - that is how the hardware
multithreading works. In practice, for the given use case, you will find
yourself trashing the instruction cache if you use too many divergent
warps. Two, perhaps three, completely different code paths on each MP
works well.

192 or 256 threads are a minimum to hide instruction latency, leading to
the conclusion that the effective vector as seen by the outside world
needs to be at most 128 elements wide (256/2, which is what I currently
use) and possibly as low as 64 (192 threads / 3 codepaths)


> I'll probably now test if directly using CUDA would give some benefit.
> Maybe we can share some ideas?

CUDA is nice :)

> Stéphane 
> 
/j

-- 
jedes mal wenn du eine quintparallele verwendest
tötet bach ein kätzchen.

 http://www.youtube.com/watch?v=43RdmmNaGfQ

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Stéphane Letz

Le 12 oct. 2010 à 20:11, Jens M Andreasen a écrit :

> 
> On Tue, 2010-10-12 at 16:29 +0200, Stéphane Letz wrote:
>>> 
> 
>> I've done some test using OpenCL in the context of the Faust project
>> (http://faust.grame.fr/). Up to now results are not really good, and I
>> guess CUDA/OpenCL will be usable only in specific cases. 
> 
> What kinds of parallellism have you been exploring? 

Well, Faust is able to generate a DAG of separated loops, some of them are data 
parallelizable, others are not (recursive).
Right now I'm testing a simple strategy where the DAG is reduced as a sequence 
of  "group of parallel loops" slices. Sync points are added between slices.

Data paralelizable loops are not yet correctly handled, which has to be done 
obviously. So the current model is basically tasks parallellism and is quite 
naive...

> 
> I have found that the multiple channel strip approach with mixdown to
> subgroups is straight-forward for a DAW, as well as for polysynths.
> 
> Global sync points between multiprocessors - for cascaded processing -
> works up to a limit after which the squared cost of the sync eats up the
> computational value of the added multiprocessor. I am syncing the 6 MP's
> on a GT220 every 16 samples at 96kHz with a penalty in the 5% range. On
> higher end cards this approach is not very useful though. Discouraged by
> Nvidia staffers also ...
> 
> 
> In theory, the granularity of the vector needs to be no higher than 32
> vector elements to be efficient - that is how the hardware
> multithreading works. In practice, for the given use case, you will find
> yourself trashing the instruction cache if you use too many divergent
> warps. Two, perhaps three, completely different code paths on each MP
> works well.
> 
> 192 or 256 threads are a minimum to hide instruction latency, leading to
> the conclusion that the effective vector as seen by the outside world
> needs to be at most 128 elements wide (256/2, which is what I currently
> use) and possibly as low as 64 (192 threads / 3 codepaths)

Well you obviously have a lot of practical knowledge I don't have. Any code 
samples you could share?
> 
> 
>> I'll probably now test if directly using CUDA would give some benefit.
>> Maybe we can share some ideas?
> 
> CUDA is nice :)

So I'll try.

Thanks

Stéphane

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Jens M Andreasen

On Tue, 2010-10-12 at 20:30 +0200, Stéphane Letz wrote:

> 
> Well you obviously have a lot of practical knowledge I don't have. Any
> code samples you could share?
> > 
Examples of what? 

I don't know where you are heading nor what kind of hardware you are
considering - and specifically I do not believe the single plugin
philosophy to be useful at all on the GPU... If that is where you wanted
to go.

BTW: Have you read the CUDA Programming Guide yet?


-- 
jedes mal wenn du eine quintparallele verwendest
tötet bach ein kätzchen.

 http://www.youtube.com/watch?v=43RdmmNaGfQ

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Stéphane Letz

Le 12 oct. 2010 à 22:10, Jens M Andreasen a écrit :

> 
> On Tue, 2010-10-12 at 20:30 +0200, Stéphane Letz wrote:
> 
>> 
>> Well you obviously have a lot of practical knowledge I don't have. Any
>> code samples you could share?
>>> 
> Examples of what? 

Example of CUDA used for audio. 
> 
> I don't know where you are heading nor what kind of hardware you are
> considering - and specifically I do not believe the single plugin
> philosophy to be useful at all on the GPU... If that is where you wanted
> to go.

The point for us is to find if a general DSP programming language like Faust 
can take benefit of a CUDA/OpenCL backend, to be used obviously for heavy 
algorithms that show massive data et task parallelism.

> 
> BTW: Have you read the CUDA Programming Guide yet?

Started yes.

Thanks

Stéphane

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-12 Thread Jens M Andreasen

On Wed, 2010-10-13 at 07:09 +0200, Stéphane Letz wrote:

> > Examples of what? 
> 
> Example of CUDA used for audio. 

If we for now ignore getting data in and out - which I understand you
are reading  up upon now - there are two main use cases:

a) Vertical signal flow: The signal flows from the top of each thread,
starting at a location in shared memory chosen by indexing and ends up
in another, fixed and known location, also in shared memory. Shared
memory works like a switchboard with 4096 plugs, where any output can be
the input of any other. In this case the code is identical to what
you'll find in textbooks or floating around in DSP forums, including
this one. The difference being that you'll get 128 instances of each
functionality rather than just one.

b) Horisontal signal flow: The above approach does not work for things
like say delaylines (longer than a few samples) where you want each
instance to have their delaytime individually parameterized. In this
case you will therefore first transform the outputs so that a collection
of vertical signals like:

  a A 1 $   a b c d..
  b B 2 #   A B C D..
  c C 3 !becomes horisontal:1 2 3 4..
  d D 4 ?   $ # ! ?..
  : : : :

You will want to do this in chunks of 16 elements because this is how
the memory controller works. A codepath driving 128 threads can then
adress 8 unrelated global memory locations in parallel to store as well
as load the right chunks(s) back into shared memory - after which the
individual threads can look left and right for finetuning, interpolation
and/or FIR filtering.


-- 
eins, zwei, drei ... tekno tekno??

http://www.youtube.com/watch?v=ZEgbW1FxR78

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev


Re: [LAD] Linux-audio-dev Digest, Vol 44, Issue 6

2010-10-14 Thread Niels Mayer
On Tue, Oct 12, 2010 at 10:09 PM, Stéphane Letz  wrote:
> Example of CUDA used for audio.

From ( 
http://old.nabble.com/sound-processing-in-GPU-w--Nvidia-CUDA---(was-Re:-fm-synthesis-software-)-p28142820.html
):
GPU processing for sound via CUDA has already been done a little bit
in the windows/mac world:
http://www.acusticaudio.net/modules.php?name=Products&file=nebula3
http://www.kvraudio.com/forum/viewtopic.php?t=222978
http://www.kvraudio.com/forum/viewtopic.php?t=240824

http://www.nvidia.com/content/GTC/posters/2010/C01-Exploring-Recognition-Network-Representations-for-Efficient-Speech-Inference-on-the-GPU.pdf

C01 - Exploring Recognition Network Representations for Efficient
Speech Inference on the GPU
We explore two contending recognition network representations for
speech inference engines: the linear lexical model (LLM) and the
weighted finite state transducer (WFST) on NVIDIA GTX285 and GTX480
GPUs. We demonstrate that while an inference engine using the simpler
LLM representation evaluates 22x more transitions per second than the
advanced WFST representation, the simple structure of the LLM
representation allows 4.7-6.4x faster evaluation and 53-65x faster
operands gathering for each state transition. We illustrate that the
performance of a speech inference engine based on the LLM
representation is competitive with the WFST representation on highly
parallel GPUs.
Author: Jike Chong (Parasians, LLC)

http://www.nvidia.com/content/GTC/posters/2010/C02-Efficient-Automatic-Speech-Recognition-on-the-GPU.pdf

C02 - Efficient Automatic Speech Recognition on the GPU
Automatic speech recognition (ASR) technology is emerging as a
critical component in data analytics for a wealth of media data being
generated everyday. ASR-based applications contain fine-grained
concurrency that has great potential to be exploited on the GPU.
However, the state-of-art ASR algorithm involves a highly parallel
graph traversal on an irregular graph with millions of states and
arcs, making efficient parallel implementations highly challenging. We
present four generalizable techniques including: dynamic data-gather
buffer, find-unique, lock-free data structures using atomics, and
hybrid global/local task queues. When used together, these techniques
can effectively resolve ASR implementation challenges on an NVIDIA
GPU.
Author: Jike Chong (Parasians, LLC)

-- Niels
http://nielsmayer.com
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/listinfo/linux-audio-dev