Re: GPGPU progess

2017-06-03 Thread David Nadlinger via Digitalmars-d

On Saturday, 3 June 2017 at 23:13:09 UTC, Nicholas Wilson wrote:

Good idea, name of global needs to be known in advance though.


Why, and how would that be a non-trivial problem?

 — David


Re: GPGPU progess

2017-06-03 Thread Nicholas Wilson via Digitalmars-d

On Saturday, 3 June 2017 at 14:12:28 UTC, David Nadlinger wrote:

On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
Alas no. __attribute__((target(...)) works because it 
targeting different targets of the _same_ backend, this 
targets different backends.


But surely you could just take the buffer with the object code 
emitted by the compute target backend and emit it as constant 
global data into the host llvm::Module?


 — David


Good idea, name of global needs to be known in advance though.


Re: GPGPU progess

2017-06-03 Thread David Nadlinger via Digitalmars-d

On Saturday, 3 June 2017 at 12:13:41 UTC, Nicholas Wilson wrote:
Alas no. __attribute__((target(...)) works because it targeting 
different targets of the _same_ backend, this targets different 
backends.


But surely you could just take the buffer with the object code 
emitted by the compute target backend and emit it as constant 
global data into the host llvm::Module?


 — David


Re: GPGPU progess

2017-06-03 Thread Nicholas Wilson via Digitalmars-d

On Saturday, 3 June 2017 at 11:58:14 UTC, Manu wrote:
Is it possible to convince the compiler to emit code built for 
the backend target directly into the same object file as the 
host code?

I feel like this should be possible, along the lines of
__attribute__((target(...)) to convince the compiler to 
generate code for a few functions with different targets than 
the module?




Alas no. __attribute__((target(...)) works because it targeting 
different targets of the _same_ backend, this targets different 
backends.


Using solutions as you suggest above introduces dependent build 
sequencing into the build script. Different build systems might 
prove to be more or less difficult to integrate cleanly, and 
many people use build-script generators which might need to 
learn a few new tricks.




Hence why I made reading from disk (with a predictable output 
name) a viable option.
Any build script trickery would simply just feed in slightly 
later down the pipeline of.


read file from disk -> in memory -> compute API -> binary -> 
execute


Also thats how using other kernels not written in D would work, 
just plug somewhere into the pipeline.



'Expertise' is possibly not the word I'd suggest ;)
But I'll have some established software by that time that I'd 
love to attempt to port, we can work through rough edges 
together when you're available. No rush.


Great. Heh, still more experience than I have. Looking forward to 
it, earliest would probably be August.




Re: GPGPU progess

2017-06-03 Thread Manu via Digitalmars-d
On 30 May 2017 at 19:54, Nicholas Wilson via Digitalmars-d <
digitalmars-d@puremagic.com> wrote:

>
> as for embedding in the binary a post build step that does
>
> ubyte[] ptx_code = import("kernels_cuda620_64.ptx");
>
> should be doable as should invoking ptxas and doing the same.
> Then proving a consistent naming convention is used the code can do its
> magic.
> Or the files could just be read from disk.


Is it possible to convince the compiler to emit code built for the backend
target directly into the same object file as the host code?
I feel like this should be possible, along the lines of
__attribute__((target(...)) to convince the compiler to generate code for a
few functions with different targets than the module?

Using solutions as you suggest above introduces dependent build sequencing
into the build script. Different build systems might prove to be more or
less difficult to integrate cleanly, and many people use build-script
generators which might need to learn a few new tricks.

Any input with your expertise with CUDA will be much appriciated.
>

'Expertise' is possibly not the word I'd suggest ;)
But I'll have some established software by that time that I'd love to
attempt to port, we can work through rough edges together when you're
available. No rush.


Re: GPGPU progess

2017-05-30 Thread Nicholas Wilson via Digitalmars-d

On Tuesday, 30 May 2017 at 08:14:16 UTC, Manu wrote:
On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d < 
digitalmars-d@puremagic.com> wrote:



On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:

How far are we from integration into LDC without using forked 
compilers?




The future is now!

https://forum.dlang.org/thread/zcfqujlgnultnqfks...@forum.dlang.org

https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
4c101e2c029c4757c4073f487
is the final commit from the stuff I've done prior to dconf.



Awesome stuff! That was fast :)

You're right, I'm using kernel<<<...>>>, and it's very 
convenient.


Yep, thats (one of the reasons) why CUDA is more successful than 
OpenCL and therefore one of the more powerful draws for those 
poor sods using OpenCL.


I looked briefly and realised that I had a lot of work to get 
running (as
you describe), so I stuck with my current setup for the moment 
:(




I figured, I'll get you using it eventually.

Is a <<<...>>> equivalent going to be possible in D, with 
kernel object fragments built into the binary together with the 
CPU code?


As I explained in my dconf presentation: the idea is to have
Queue q = ... ; // the equivalent of a CUDA stream

q.enqueue!kernel(sizes)(kernel_arguments);

where q.enqueue returns a callable that you then call with the 
arguments. It was modelled directly after CUDAs <<<...>>>


as for embedding in the binary a post build step that does

ubyte[] ptx_code = import("kernels_cuda620_64.ptx");

should be doable as should invoking ptxas and doing the same.
Then proving a consistent naming convention is used the code can 
do its magic.

Or the files could just be read from disk.



I'm definitely looking forward to action in this space, and the 
wiki to come online :)


Yeah once my thesis is done thing should start moving. Any input 
with your expertise with CUDA will be much appriciated.




Re: GPGPU progess

2017-05-30 Thread Manu via Digitalmars-d
On 30 May 2017 at 17:33, Nicholas Wilson via Digitalmars-d <
digitalmars-d@puremagic.com> wrote:

> On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
>
>> How far are we from integration into LDC without using forked compilers?
>>
>
> The future is now!
>
> https://forum.dlang.org/thread/zcfqujlgnultnqfks...@forum.dlang.org
>
> https://github.com/ldc-developers/ldc/commit/69ad69e872f53c1
> 4c101e2c029c4757c4073f487
> is the final commit from the stuff I've done prior to dconf.
>

Awesome stuff! That was fast :)

You're right, I'm using kernel<<<...>>>, and it's very convenient.
I looked briefly and realised that I had a lot of work to get running (as
you describe), so I stuck with my current setup for the moment :(

Is a <<<...>>> equivalent going to be possible in D, with kernel object
fragments built into the binary together with the CPU code?

I'm definitely looking forward to action in this space, and the wiki to
come online :)


Re: GPGPU progess

2017-05-30 Thread Nicholas Wilson via Digitalmars-d

On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
How far are we from integration into LDC without using forked 
compilers?


The future is now!

https://forum.dlang.org/thread/zcfqujlgnultnqfks...@forum.dlang.org

https://github.com/ldc-developers/ldc/commit/69ad69e872f53c14c101e2c029c4757c4073f487
is the final commit from the stuff I've done prior to dconf.


Re: GPGPU progess

2017-05-18 Thread Nicholas Wilson via Digitalmars-d

On Thursday, 18 May 2017 at 09:07:38 UTC, Nicholas Wilson wrote:
When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx 
is the CUDA compute capability specified on the command line 
and yy is 32 or 64 for 32 or 64bit) which should fit somewhere 
into your existing C++ pipeline.


Whoops, that assumes you have a CUDA driver API pipeline in your 
C++ code, which if you're asking I'm not sure that you have.
If you're using the `kernel<<<...>>>(args)` form to launch you 
kernels then you are going to have a lot more work to do in D 
because you'll need to use the driver API 
(http://docs.nvidia.com/cuda/cuda-driver-api/#axzz4hQLA0Zdm)

You'll need to:
*get a device
*create a context from it
*get a stream on that context
*load the ptx module (possibly linking it with other modules, to 
resolve missing symbols).

*compile it for the device
*then launch a kernel from that module on that device, by name 
passing the arguments in a void*[].


The sad thing is that its still nice than OpenCL because in 
OpenCL you have to pass the runtime args (with sizes) one by one 
to a function.


Hence why I want to automate as much of that shit as is possible.
 I hope to have that done ASAP, but I don't have hardware set up 
to test CUDA at the moment (I have one on my windows box but I 
don't have dev set up there) and I'll be working on OpenCL at the 
same time (and theres only so much horrible API I can take in a 
day).
 I'll be working on dcompute part-part-time next semester though 
so I should be able to get a fair bit done and quite a few others 
are interested so that'll speed thing up a bit.


Re: GPGPU progess

2017-05-18 Thread Nicholas Wilson via Digitalmars-d

On Thursday, 18 May 2017 at 05:39:52 UTC, Manu wrote:
I just caught up on the dconf material. It was a really 
interesting year, and I'm super bummed I missed it!


So I'm just starting out on some GPGPU work, and I love the 
direction we're going here.


Given this scenario;
I want to write CUDA kernels in D, and execute them in an 
existing C++ CUDA

app.


Can do already although not all the intrinsics are there yet, but 
they are very simple to add. See 
https://github.com/libmir/dcompute/blob/master/source/dcompute/std/cuda/index.d for how easy it is (PRs welcome!)


Can anyone point me at the getting started material, how do I 
achieve this

seemingly simple goal?


If you're only interested in CUDA I think any LLVM 3.9.1 or 4.0 
should work, If you want SPIRV get my LLVM from 
https://github.com/thewilsonator/llvm/tree/compute


Get the dcompute branch of LDC ( 
https://github.com/ldc-developers/ldc/tree/dcompute) and its 
submodules (i.e. druntime, I don't think it points to a branch 
anymore because Johan deleted that when I merged that into the 
regular ldc druntime but it will still work.) The dcompute branch 
of LDC is in the process of being merged into master so expect 
some breakage when updating to that.


Has anyone started on this content? Where are the toolchains, 
how to I

build a CUDA kernel, how do I link/load it in my C++ app?



There is some info on the dcompute wiki 
(https://github.com/libmir/dcompute/wiki) and I intend make this 
into "production quality" documentation (again PRs welcome or I 
can give you write access to the wiki), but I'm very busy at the 
moment.


When ldc runs you will get a kernels_cudaxxx_yy.ptx (where xxx is 
the CUDA compute capability specified on the command line and yy 
is 32 or 64 for 32 or 64bit) which should fit somewhere into your 
existing C++ pipeline.


If this doesn't exist, there needs to be a wiki with getting 
started guide covering various usage environments. I don't have 
a lot of time to spare on detail, so success will depend on my 
ability to hit the ground running, and that's really just a 
matter of simple clear reference material.


How far are we from integration into LDC without using forked 
compilers?


Soon™, it the first thing I'm going to do (if I don't get it done 
before) after I hand in my honours thesis at the start of July. 
I'm stupidly busy this semester so don't count on it done 
beforehand.


If you have any questions about anything probably the fastest way 
is the libmir or ldc gitter channels. 
(https://gitter.im/ldc-developers/main , 
https://gitter.im/libmir/public).


Good luck! I look forward to your experience report.
Nic



GPGPU progess

2017-05-17 Thread Manu via Digitalmars-d
I just caught up on the dconf material. It was a really interesting year,
and I'm super bummed I missed it!

So I'm just starting out on some GPGPU work, and I love the direction we're
going here.

Given this scenario;
I want to write CUDA kernels in D, and execute them in an existing C++ CUDA
app.
Can anyone point me at the getting started material, how do I achieve this
seemingly simple goal?
Has anyone started on this content? Where are the toolchains, how to I
build a CUDA kernel, how do I link/load it in my C++ app?

If this doesn't exist, there needs to be a wiki with getting started guide
covering various usage environments. I don't have a lot of time to spare on
detail, so success will depend on my ability to hit the ground running, and
that's really just a matter of simple clear reference material.

How far are we from integration into LDC without using forked compilers?