Re: listing template function overloads for use in compile time decisions

2014-04-28 Thread Atash via Digitalmars-d

On Tuesday, 29 April 2014 at 06:22:34 UTC, Atash wrote:
Let's say that we have a struct `A` that contains some template 
function named `fn` template-parameterized on its argument 
types:


struct A {
...
void fn(A)(auto ref A a) { ... }
...
}

I cannot get a handle on `fn` as an alias when searching for 
overloads of the string "fn" in `A` via 
`__traits(getOverloads,A,"fn")`. This makes sense, obviously, 
because `fn` doesn't really 'exist' as a template.


But the compiler can, nevertheless, generate a proper 
implementation of `fn` depending on its argument(s). It doesn't 
have to create the code, as with e.g. `__traits(compiles, 
A.init.fn(0))`. While it doesn't make sense to list overloads 
of a given name when they're templates, it does make sense that 
given some argument types and qualifiers all candidate 
functions can be easily enumerated. I can't find such a 
feature, however.


Moreover, I cannot figure out how one could acquire a handle on 
even *just* the best match for a given function name with some 
given argument types.


With this, library writers could perform their own overload 
resolution without enforcing the use of wrapper classes when 
trying to plug one library into another library, ex. lib A has 
struct A with `fn` and lib B has struct B with `fn` and they 
have functions `fn` that accept each other and we want to 
choose the one that partially specializes on the other over the 
one that doesn't. It's basically the decision process behind 
the rewriting that occurs with a.opCmp(b) vs. b.opCmp(a), but 
fully emulated in the presence of templates without extra 
client-code-side hints and taking into account granularity 
finer than the four levels of overload resolution. It moves 
glue-code (or glue-behavior like argument ordering to a library 
function) from the user to the library writer, and allows that 
glue-code to be generic.


Is this a facility that is present in D, and I missed it? Are 
any of the above bulleted use-cases manageable with present-day 
D?


I'm kind of an obsessive metaprogramming-fiend, so this little 
issue is strangely vexing. I've come up with an idea for a 
solution, and am attempting to implement it, but it's 
extraordinarily hackish. It assumes that the name `fn` when 
called can be entirely resolved with its arguments by looking 
at the arguments' types and their template parameters (if any) 
and implicit target conversions (and their template parameters 
[if any]). I'm seeing in my head a combinatoric blow-up from 
the possible orderings of template arguments in the template 
declaration of `fn`, so... yeah. Kinda would like a __traits 
thing that gets all possible resolutions of a symbol in a call 
expression.


Thanks for your time~!


Ignore the 'bulleted' bit. I edited that preceding section from a 
list to a paragraph.


listing template function overloads for use in compile time decisions

2014-04-28 Thread Atash via Digitalmars-d
Let's say that we have a struct `A` that contains some template 
function named `fn` template-parameterized on its argument types:


struct A {
...
void fn(A)(auto ref A a) { ... }
...
}

I cannot get a handle on `fn` as an alias when searching for 
overloads of the string "fn" in `A` via 
`__traits(getOverloads,A,"fn")`. This makes sense, obviously, 
because `fn` doesn't really 'exist' as a template.


But the compiler can, nevertheless, generate a proper 
implementation of `fn` depending on its argument(s). It doesn't 
have to create the code, as with e.g. `__traits(compiles, 
A.init.fn(0))`. While it doesn't make sense to list overloads of 
a given name when they're templates, it does make sense that 
given some argument types and qualifiers all candidate functions 
can be easily enumerated. I can't find such a feature, however.


Moreover, I cannot figure out how one could acquire a handle on 
even *just* the best match for a given function name with some 
given argument types.


With this, library writers could perform their own overload 
resolution without enforcing the use of wrapper classes when 
trying to plug one library into another library, ex. lib A has 
struct A with `fn` and lib B has struct B with `fn` and they have 
functions `fn` that accept each other and we want to choose the 
one that partially specializes on the other over the one that 
doesn't. It's basically the decision process behind the rewriting 
that occurs with a.opCmp(b) vs. b.opCmp(a), but fully emulated in 
the presence of templates without extra client-code-side hints 
and taking into account granularity finer than the four levels of 
overload resolution. It moves glue-code (or glue-behavior like 
argument ordering to a library function) from the user to the 
library writer, and allows that glue-code to be generic.


Is this a facility that is present in D, and I missed it? Are any 
of the above bulleted use-cases manageable with present-day D?


I'm kind of an obsessive metaprogramming-fiend, so this little 
issue is strangely vexing. I've come up with an idea for a 
solution, and am attempting to implement it, but it's 
extraordinarily hackish. It assumes that the name `fn` when 
called can be entirely resolved with its arguments by looking at 
the arguments' types and their template parameters (if any) and 
implicit target conversions (and their template parameters [if 
any]). I'm seeing in my head a combinatoric blow-up from the 
possible orderings of template arguments in the template 
declaration of `fn`, so... yeah. Kinda would like a __traits 
thing that gets all possible resolutions of a symbol in a call 
expression.


Thanks for your time~!


Re: s/type tuple/template pack/g please

2013-08-22 Thread Atash

On Wednesday, 21 August 2013 at 18:50:30 UTC, Ali Çehreli wrote:

On 08/21/2013 11:40 AM, Atash wrote:

> I don't see wording 'template pack' being problematic,
assuming that there's
> really no other way to use them but through templates (which
AFAIK they
> can't).

TypeTuple can represent function parameter lists and array 
literal lists as well:


Under the assumption that the following was meant to be a 
counterpoint...



import std.typetuple;

void func(int, int)
{}

struct S(int a, int b)
{}

void main()
{
alias tt = TypeTuple!(1, 2);


^ I'm not seeing the case given that TypeTuple is a template.

== What are we *actually* talking about?

I feel like this wasn't well defined, because now I'm feeling 
mildly confused with where the discussion has gone.


template Stuff(A...) {  } // a sequence/tuple of template 
parameters

alias TypeTuple!() B; // std.TypeTuple
TypeTuple!() C; // value tuple

I'm inclined to say that the debate is currently about A, but 
just to make sure... are we talking about A, B, C, or something 
else entirely?


== Annnddd more general stuff

Under the assumption that we are talking about A in the section 
above...


IMO, it's almost inevitable that whatever noun you guys decide on 
is going to be preceded by the word 'template' in conversation 
and in written communication if it isn't there already. While 
that sort of casual wording may be easily relegated to a synonym, 
it's still worth thinking about, methinks.


Re: [dox] C++ Interfaces

2013-08-20 Thread Atash

On Tuesday, 20 August 2013 at 14:55:49 UTC, Wyatt wrote:
For the time being, I'd make a pull request that changes the 
link on the interface page to the correct URI.  That's the bare 
minimum.


-Wyatt


H'okay, so, I've cloned out DMD, Druntime, Phobos, and dlang.org 
from github. Running `make -f posix.mak` in dlang.org results in 
numerous compilation errors for the D-files after invocations of 
`dmd -d -c -o- -version [yaddayaddayadda]`.


I also happen to be using version 2.063 of dmd from my Ubuntu 
repository...


I realize this may sound like a silly question, but... Were 
breaking changes introduced between 2.063 and 2.064? Such that I 
should take the effort to oust my current system installation of 
DMD in favor of the github one?


Re: GPGPU and D

2013-08-20 Thread Atash

On Sunday, 18 August 2013 at 19:02:50 UTC, luminousone wrote:

On Sunday, 18 August 2013 at 08:40:33 UTC, Russel Winder wrote:

Luminousone, Atash, John,

Thanks for the email exchanges on this, there is a lot of good 
stuff in
there that needs to be extracted from the mail threads and 
turned into a
"manifesto" type document that can be used to drive getting a 
design and
realization together. The question is what infrastructure 
would work for
us to collaborate. Perhaps create a GitHub group and a 
repository to act

as a shared filestore?

I can certainly do that bit of admin and then try and start a 
document
summarizing the email threads so far, if that is a good way 
forward on

this.


Github seems fine to me, my coding skills are likely more 
limited then Atash or John; I am currently working as a student 
programmer at Utah's Weber State University while also 
attending as a part time student, I am currently working on 
finishing the last couple credit hours of the assoc degree in 
CS, I would like to work towards a higher level degree or even 
eventually a Phd in CS.


I will help in whatever way I can however.


You give me too much credit. :-P

I'm yet another student, technically in biomedical engineering 
but with a very computer-science-y mind. My experience with 
OpenCL has been limited to a few stints into some matrix 
operations and implementing that sort I linked earlier for the 
sake of a max-reduction operation found in a GPGPU implementation 
of support-vector machine classifiers. Frankly, I *hate* *hate* 
*hate* boilerplate, so I paradoxically spend all my time trying 
to get it out of the way so I never need to write it again. 
Decent for code-prettiness, horrid for deadlines. That said...


I'm hesitant to start anything new until I've cleared my plate of 
at least one of my current projects, so while I am very 
interested in jumping on this, I'm going to have to pass on doing 
anything serious with it for the next several weeks. -.-'


[dox] C++ Interfaces

2013-08-20 Thread Atash

This link: http://dlang.org/CPP-Interfaces

Is made on this page: http://dlang.org/interface.html

And I *think* (not sure) that since whatever it was originally 
pointing to has disappeared, it should probably end up pointing 
here:


http://dlang.org/cpp_interface.html

Maybe add an anchor to about halfway down the page, so it'd look 
more like:


http://dlang.org/cpp_interface.html#extern-CPP-interfaces



That all said...

Is my assessment correct/proper?

If so, in the great words of my esteemed predecessors from the 
vast wilds of newbie-dom... "What do?"


Re: GPGPUs

2013-08-18 Thread Atash
I'm not sure if 'problem space' is the industry standard term (in 
fact I doubt it), but it's certainly a term I've used over the 
years by taking a leaf out of math books and whatever my 
professors had touted. :-D I wish I knew what the standard term 
was, but for now I'm latching onto that because it seems to 
describe at a high implementation-agnostic level what's up, and 
in my personal experience most people seem to 'get it' when I use 
the term - it has empirically had an accurate connotation.


That all said, I'd like to know what the actual term is, too. -.-'

On Sunday, 18 August 2013 at 08:21:18 UTC, luminousone wrote:
I chose the term aggregate, because it is the term used in the 
description of the foreach syntax.


foreach( value, key ; aggregate )

aggregate being an array or range, it seems to fit as even when 
the aggregate is an array, as you still implicitly have a range 
being "0 .. array.length", and will have a key or index 
position created by the foreach in addition to the value.


A wrapped function could very easily be similar to the intended 
initial outcome


void example( ref float a[], float b[], float c[] ) {

   foreach( v, k ; a ) {
  a[k] = b[k] + c[k];
   }
}

is functionally the same as

void example( aggregate ref float a[] ; k, float b[], float c[] 
) {

   a[k] = b[k] + c[k];
}

maybe : would make more sense then ; but I am not sure as to 
the best way to represent that index value.


Aye, that makes awesome sense, but I'm left wishing that there 
was something in that syntax to support access to local/shared 
memory between work-items. Or, better yet, some way of hinting at 
desired amounts of memory in the various levels of the non-global 
memory hierarchy and a way of accessing those requested 
allocations.


I mean, I haven't *seen* anyone do anything device-wise with more 
hierarchical levels than just global-shared-private, but it's 
always bothered me that in OpenCL we could only specify memory 
allocations on those three levels. What if someone dropped in 
another hierarchical level? Suddenly it'd open another door to 
optimization of code, and there'd be no way for OpenCL to access 
it. Or what if someone scrapped local memory altogether, for 
whatever reason? The industry may never add/remove such memory 
levels, but, still, it just feels... kinda wrong that OpenCL 
doesn't have an immediate way of saying, "A'ight, it's cool that 
you have this, Mr. XYZ-compute-device, I can deal with it," 
before proceeding to put on sunglasses and driving away in a 
Ferrari. Or something like that.


Re: GPGPUs

2013-08-18 Thread Atash

On Sunday, 18 August 2013 at 06:22:30 UTC, luminousone wrote:
The Xeon Phi is interesting in so far as taking generic 
programming to a more parallel environment. However it has some 
serious limitations that will heavily damage its potential 
performance.


AVX2 is completely the wrong path to go about improving 
performance in parallel computing, The SIMD nature of this 
instruction set means that scalar operations, or even just not 
being able to fill the giant 256/512bit register wastes huge 
chunks of this things peak theoretical performance, and if any 
rules apply to instruction pairing on this multi issue pipeline 
you have yet more potential for wasted cycles.


I haven't seen anything about intels, micro thread scheduler, 
or how these chips handle mass context switching natural of 
micro threaded environments, These two items make a huge 
difference in performance, comparing radeon VLIW5/4 to radeon 
GCN is a good example, most of the performance benefit of GCN 
is from easy of scheduling scalar pipelines over more complex 
pipes with instruction pairing rules etc.


Frankly Intel, has some cool stuff, but they have been caught 
with their pants down, they have depended on their large fab 
advantage to carry them over and got lazy.


We likely are watching AMD64 all over again.


Well, I can't argue that one.

A first, simply a different way of approaching std.parallel 
like functionality, with an eye gpgpu in the future when easy 
integration solutions popup(such as HSA).


I can't argue with that either.

It would be best to wait for a more generic software platform, 
to find out how this is handled by the next generation of micro 
threading tools.


The way openCL/CUDA work reminds me to much of someone setting 
up tomcat to have java code generate php that runs on their 
apache server, just because they can. I would rather tighter 
integration with the core language, then having a language in 
language.


Fair point. I have my own share of idyllic wants, so I can't 
argue with those.


Low level optimization is a wonderful thing, But I almost 
wonder if this will always be something where in order todo the 
low level optimization you will be using the vendors provided 
platform for doing it, as no generic tool will be able to match 
the custom one.


But OpenCL is by no means a 'custom tool'. CUDA, maybe, but 
OpenCL just doesn't fit the bill in my opinion. I can see it 
being possible in the future that it'd be considered 'low-level', 
but it's a fairly generic solution. A little hackneyed under your 
earlier metaphors, but still a generic, standard solution.


Most of my interaction with the gpu is via shader programs for 
Opengl, I have only lightly used CUDA for some image processing 
software, So I am certainly not the one to give in depth detail 
to optimization strategies.


There was a *lot* of stuff that opened up when vendors dumped 
GPGPU out of Pandora's box. If you want to get a feel for some 
optimization strategies and what they require, check this site 
out: http://www.bealto.com/gpu-sorting_intro.html (and I hope I'm 
not insulting your intelligence here, if I am, I truly apologize).



sorry on point 1, that was a typo, I meant

1. The range must be known prior to execution of a gpu code 
block.


as for

3. Code blocks can only receive a single range, it can however 
be multidimensional


int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float
  b[]){
  b[k]   = a[k];
}
example3( 0 .. 100 , a,b);

This function would be executed 100 times.

int a[10_000] = [ ... ];
int b[10_000];
void example3( aggregate in range r ; kx,aggregate in range r2 
; ky, in float a[], float

  b[]){
  b[kx+(ky*100)]   = a[kx+(ky*100)];
}
example3( 0 .. 100 , 0 .. 100 , a,b);

this function would be executed 10,000 times. the two aggregate 
ranges being treated as a single 2 dimensional range.


Maybe a better description of the rule would be that multiple 
ranges are multiplicative, and functionally operate as a single 
range.


OH.

I think I was totally misunderstanding you earlier. The 
'aggregate' is the range over the *problem space*, not the values 
being punched into the problem. Is this true or false?


(if true I'm about to feel incredibly sheepish)


Re: GPGPUs

2013-08-17 Thread Atash

On Sunday, 18 August 2013 at 03:55:58 UTC, luminousone wrote:
You do have limited Atomics, but you don't really have any sort 
of complex messages, or anything like that.


I said 'point 11', not 'point 10'. You also dodged points 1 and 
3...


Intel doesn't have a dog in this race, so their is no way to 
know what they plan on doing if anything at all.


http://software.intel.com/en-us/vcsource/tools/opencl-sdk
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

Just based on those, I'm pretty certain they 'have a dog in this 
race'. The dog happens to be running with MPI and OpenCL across a 
bridge made of PCIe.


The reason to point out HSA, is because it is really easy add 
support for, it is not a giant task like opencl would be. A few 
changes to the front end compiler is all that is needed, LLVM's 
backend does the rest.


H'okay. I can accept that.

OpenCL isn't just a library, it is a language extension, that 
is ran through a preprocessor that compiles the embedded 
__KERNEL and __DEVICE functions, into usable code, and then 
outputs .c/.cpp files for the c compiler to deal with.


But all those extra bits are part of the computing *environment*. 
Is there something wrong with requiring the proper environment 
for an executable?


A more objective question: which devices are you trying to target 
here?


Those are all platform specific, they change based on the whim 
and fancy of NVIDIA and AMD with each and every new chip 
released, The size and configuration of CUDA clusters, or 
compute clusters, or EU's, or whatever the hell x chip maker 
feels like using at the moment.


Long term this will all be managed by the underlying support 
software in the video drivers, and operating system kernel. 
Putting any effort into this is a waste of time.


Yes. And the only way to optimize around them is to *know them*, 
otherwise you're pinning the developer down the same way OpenMP 
does. Actually, even worse than the way OpenMP does - at least 
OpenMP lets you set some hints about how many threads you want.



void example( aggregate in float a[] ; key , in float b[], out
   float c[]) {
c[key] = a[key] + b[key];
}

example(a,b,c);

in the function declaration you can think of the aggregate 
basically having the reserve order of the items in a foreach 
statement.


int a[100] = [ ... ];
int b[100];
foreach( v, k ; a ) { b = a[k]; }

int a[100] = [ ... ];
int b[100];

void example2( aggregate in float A[] ; k, out float B[] ) { 
B[k] = A[k]; }


example2(a,b);


Contextually solid. Read my response to the next bit.

I am pretty sure they are simply multiplying the index value by 
the unit size they desire to work on


int a[100] = [ ... ];
int b[100];
void example3( aggregate in range r ; k, in float a[], float 
b[]){

   b[k]   = a[k];
   b[k+1] = a[k+1];
}

example3( 0 .. 50 , a,b);

Then likely they are simply executing multiple __KERNEL 
functions in sequence, would be my guess.


I've implemented this algorithm before in OpenCL already, and 
what you're saying so far doesn't rhyme with what's needed.


There are at least two ranges, one keeping track of partial 
summations, the other holding the partial sorts. Three separate 
kernels are ran in cycles to reduce over and scatter the data. 
The way early exit is implemented isn't mentioned as part of the 
implementation details, but my implementation of the strategy 
requires a third range to act as a flag array to be reduced over 
and read in between kernel invocations.


It isn't just unit size multiplication - there's communication 
between work-items and *exquisitely* arranged local-group 
reductions and scans (so-called 'warpscans') that take advantage 
of the widely accepted concept of a local group of work-items (a 
parameter you explicitly disregarded) and their shared memory 
pool. The entire point of the paper is that it's possible to come 
up with a general algorithm that can be parameterized to fit 
individual GPU configurations if desired. This kind of algorithm 
provides opportunities for tuning... which seem to be lost, 
unnecessarily, in what I've read so far in your descriptions.


My point being, I don't like where this is going by treating 
coprocessors, which have so far been very *very* different from 
one another, as the same batch of *whatever*. I also don't like 
that it's ignoring NVidia, and ignoring Intel's push for 
general-purpose accelerators such as their Xeon Phi.


But, meh, if HSA is so easy, then it's low-hanging fruit, so 
whatever, go ahead and push for it.


=== REMINDER OF RELEVANT STUFF FURTHER UP IN THE POST:

"A more objective question: which devices are you trying to 
target here?"


=== AND SOMETHING ELSE:

I feel like we're just on different wavelengths. At what level do 
you imagine having this support, in terms of support for doing 
low-level things? Is this something like OpenMP, where threading 
and such are done at a really (really really really...) high 
level, or wha

Re: GPGPUs

2013-08-17 Thread Atash
Unified virtual address-space I can accept, fine. Ignoring that 
it is, in fact, in a totally different address-space where memory 
latency is *entirely different*, I'm far *far* more iffy about.



We basically have to follow these rules,

1. The range must be none prior to execution of a gpu code block
2. The range can not be changed during execution of a gpu code 
block
3. Code blocks can only receive a single range, it can however 
be multidimensional

4. index keys used in a code block are immutable
5. Code blocks can only use a single key(the gpu executes many 
instances in parallel each with their own unique key)

6. index's are always an unsigned integer type
7. openCL,CUDA have no access to global state
8. gpu code blocks can not allocate memory
9. gpu code blocks can not call cpu functions
10. atomics tho available on the gpu are many times slower then 
on the cpu
11. separate running instances of the same code block on the 
gpu can not have any interdependency on each other.


Please explain point 1 (specifically the use of the word 'none'), 
and why you added in point 3?


Additionally, point 11 doesn't make any sense to me. There is 
research out there showing how to use cooperative warp-scans, for 
example, to have multiple work-items cooperate over some local 
block of memory and perform sorting in blocks. There are even 
tutorials out there for OpenCL and CUDA that shows how to do 
this, specifically to create better performing code. This 
statement is in direct contradiction with what exists.


Now if we are talking about HSA, or other similar setup, then a 
few of those rules don't apply or become fuzzy.


HSA, does have limited access to global state, HSA can call cpu 
functions that are pure, and of course because in HSA the cpu 
and gpu share the same virtual address space most of memory is 
open for access.


HSA also manages memory, via the hMMU, and their is no need for 
gpu memory management functions, as that is managed by the 
operating system and video card drivers.


Good for HSA. Now why are we latching onto this particular 
construction that, as far as I can tell, is missing the support 
of at least two highly relevant giants (Intel and NVidia)?


Basically, D would either need to opt out of legacy api's such 
as openCL, CUDA, etc, these are mostly tied to c/c++ anyway, 
and generally have ugly as sin syntax; or D would have go the 
route of a full and safe gpu subset of features.


Wrappers do a lot to change the appearance of a program. Raw 
OpenCL may look ugly, but so do BLAS and LAPACK routines. The use 
of wrappers and expression templates does a lot to clean up code 
(ex. look at the way Eigen 3 or any other linear algebra library 
does expression templates in C++; something D can do even better).


I don't think such a setup can be implemented as simply a 
library, as the GPU needs compiled source.


This doesn't make sense. Your claim is contingent on opting out 
of OpenCL or any other mechanism that provides for the 
application to carry abstract instructions which are then 
compiled on the fly. If you're okay with creating kernel code on 
the fly, this can be implemented as a library, beyond any 
reasonable doubt.


If D where to implement gpgpu features, I would actually 
suggest starting by simply adding a microthreading function 
syntax, for example...


void example( aggregate in float a[] ; key , in float b[], out 
float c[]) {

c[key] = a[key] + b[key];
}

By adding an aggregate keyword to the function, we can assume 
the range simply using the length of a[] without adding an 
extra set of brackets or something similar.


This would make access to the gpu more generic, and more 
importantly, because llvm will support HSA, removes the needs 
for writing more complex support into dmd as openCL and CUDA 
would require, a few hints for the llvm backend would be enough 
to generate the dual bytecode ELF executables.


1) If you wanted to have that 'key' nonsense in there, I'm 
thinking you'd need to add several additional parameters: global 
size, group size, group count, and maybe group-local memory 
access (requires allowing multiple aggregates?). I mean, I get 
the gist of what you're saying, this isn't me pointing out a 
problem, just trying to get a clarification on it (maybe give 
'key' some additional structure, or something).


2) ... I kind of like this idea. I disagree with how you led up 
to it, but I like the idea.


3) How do you envision *calling* microthreaded code? Just the 
usual syntax?


4) How would this handle working on subranges?

ex. Let's say I'm coding up a radix sort using something like 
this:


https://sites.google.com/site/duanemerrill/PplGpuSortingPreprint.pdf?attredirects=0

What's the high-level program organization with this syntax if we 
can only use one range at a time? How many work-items get fired 
off? What's the gpu-code launch procedure?


Re: GPGPUs

2013-08-17 Thread Atash

On Saturday, 17 August 2013 at 15:37:58 UTC, deadalnix wrote:
Unified memory have too much benefice for it to be ignored. The 
open questions are about cache coherency, direct communication 
between chips, identical performances through the address 
space, and so on. But the unified memory will remains. Even 
when memory is physically separated, you'll see an unified 
memory model emerge, with disparate performances depending on 
address space.


I'm not saying 'ignore it', I'm saying that it's not the least 
common denominator among popular devices, and that in all 
likelihood it won't be the least common denominator among compute 
devices ever. AMD/ATi being 'right' doesn't mean that they'll 
dominate the whole market. Having two slots on your mobo is more 
limiting than having the ability to just chuck more computers in 
a line hidden behind some thin wrapper around some code built to 
deal with non-uniform memory access.


Additionally, in another post, I tried to demonstrate a way for 
it to target the least common denominator, and in my (obviously 
biased) opinion, it didn't look half bad.


Unlike uniform memory access, non-uniform memory access will 
*always* be a thing. Uniform memory access is cool n'all, but it 
isn't popular enough to be here now, and it isn't like 
non-uniform memory access which has a long history of being here 
and looks like it has a permanent stay in computing.


Pragmatism dictates to me here that any tool we want to be 
'awesome', eliciting 'wowzers' from all the folk of the land, 
should target the widest variety of devices while still being 
pleasant to work with. *That* tells me that it is paramount to 
*not* brush off non-uniform access, and that because non-uniform 
access is the least common denominator, that should be what is 
targeted.


On the other hand, if we want to start up some sort of thing 
where one lib handles the paradigm of uniform memory access in as 
convenient a way as possible, and another lib handles non-uniform 
memory access, that's fine too. Except that the first lib really 
would just be a specialization of the second alongside some more 
'convenience'-functions.


Re: GPGPUs

2013-08-16 Thread Atash

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:

You can't mix cpu and gpu code, they must be separate.


H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that.


That said, this doesn't preclude having CUDA-like behavior where 
small functions could be written that don't violate the 
constraints of GPU code and simultaneously has semantics that 
could be executed on the CPU, and where such small functions are 
then allowed to be called from both CPU and GPU code.


However this still has problems of the cpu having to generate 
CPU code from the contents of gpu{} code blocks, as the GPU is 
unable to allocate memory, so for example ,


gpu{
auto resultGPU = dot(c, cGPU);
}

likely either won't work, or generates an array allocation in 
cpu code before the gpu block is otherwise ran.


I'm fine with an array allocation. I'd 'prolly have to do it 
anyway.


Also how does that dot product function know the correct index 
range to run on?, are we assuming it knows based on the length 
of a?, while the syntax,


c[] = a[] * b[];

is safe for this sort of call, a function is less safe todo 
this with, with function calls the range needs to be told to 
the function, and you would call this function without the 
gpu{} block as the function itself is marked.


auto resultGPU = dot$(0 .. 
returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);


'Dat's a point.

Remember with gpu's you don't send instructions, you send whole 
programs, and the whole program must finish before you can move 
onto the next cpu instruction.


I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a *necessary* behavior.


Well, I disagree with the assumption assuming said assumption is 
being made and I'm not just misreading that bit. :-P


=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.


I'd prefer this (and some small number of helpers) rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).


ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{
  // brings in the kernel keywords and whatnot depending 
__FUNCTION__

  // (because mixins eval where they're mixed in)
  mixin KernelDefs;
  // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
  //   keywords and the glbmem object to define several 
expressions that
  //   effectively record the operations to be performed into the 
return type


  // assignment into global memory recovers the expression type 
in the glbmem.

  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

  // Ignoring that this has a data race, this exemplifies 
recapturing the

  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

  return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.

}


Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:



auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.


The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).


Alternatively to returning the captured expressions, the argument 
glbmem could have been passed ref, and the same sort of 
expression capturing could occur. Heck, more arguments could've 
been passed, too, this doesn't require there to be one single 
argument representing global memory.


With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.


Again, though, all this requires a significant amount of 
metaprogramming, he

Re: GPGPUs

2013-08-16 Thread Atash

On Saturday, 17 August 2013 at 00:53:39 UTC, luminousone wrote:

You can't mix cpu and gpu code, they must be separate.


H'okay, let's be clear here. When you say 'mix CPU and GPU code', 
you mean you can't mix them physically in the compiled executable 
for all currently extant cases. They aren't the same. I agree 
with that. That said, this doesn't preclude having CUDA-like 
behavior where small functions could be written that don't 
violate the constraints of GPU code and simultaneously has 
semantics that could be executed on the CPU, and where such small 
functions are then allowed to be called from both CPU and GPU 
code.


However this still has problems of the cpu having to generate 
CPU code from the contents of gpu{} code blocks, as the GPU is 
unable to allocate memory, so for example ,


gpu{
auto resultGPU = dot(c, cGPU);
}

likely either won't work, or generates an array allocation in 
cpu code before the gpu block is otherwise ran.


I wouldn't be so negative with the 'won't work' bit, 'cuz frankly 
the 'or' you wrote there is semantically like what OpenCL and 
CUDA do anyway.


Also how does that dot product function know the correct index 
range to run on?, are we assuming it knows based on the length 
of a?, while the syntax,


c[] = a[] * b[];

is safe for this sort of call, a function is less safe todo 
this with, with function calls the range needs to be told to 
the function, and you would call this function without the 
gpu{} block as the function itself is marked.


auto resultGPU = dot$(0 .. 
returnLesser(cGPU.length,dGPU.length))(cGPU, dGPU);


I think it was mentioned earlier that there should be, much like 
in OpenCL or CUDA, builtins or otherwise available symbols for 
getting the global identifier of each work-item, the work-group 
size, global size, etc.


Remember with gpu's you don't send instructions, you send whole 
programs, and the whole program must finish before you can move 
onto the next cpu instruction.


I disagree with the assumption that the CPU must wait for the GPU 
while the GPU is executing. Perhaps by default the behavior could 
be helpful for sequencing global memory in the GPU with CPU 
operations, but it's not a necessary behavior (see OpenCL and 
it's, in my opinion, really nice queuing mechanism).


=== Another thing...

I'm with luminousone's suggestion for some manner of function 
attribute, to the tune of several metric tonnes of chimes. Wind 
chimes. I'm supporting this suggestion with at least a metric 
tonne of wind chimes.


*This* (and some small number of helpers), rather than 
straight-up dumping a new keyword and block type into the 
language. I really don't think D *needs* to have this any lower 
level than a library based solution, because it already has the 
tools to make it ridiculously more convenient than C/C++ (not 
necessarily as much as CUDA's totally separate program nvcc does, 
but a huge amount).


ex.


@kernel auto myFun(BufferT)(BufferT glbmem)
{
  // brings in the kernel keywords and whatnot depending 
__FUNCTION__

  // (because mixins eval where they're mixed in)
  mixin KernelDefs;
  // ^ and that's just about all the syntactic noise, the rest 
uses mixed-in
  //   keywords and the glbmem object to define several 
expressions that
  //   effectively record the operations to be performed into the 
return type


  // assignment into global memory recovers the expression type 
in the glbmem.

  glbmem[glbid] += 4;

  // This assigns the *expression* glbmem[glbid] to val.
  auto val = glbmem[glbid];

  // Ignoring that this has a data race, this exemplifies 
recapturing the

  // expression 'val' (glbmem[glbid]) in glbmem[glbid+1].
  glbmem[glbid+1] = val;

  return glbmem; ///< I lied about the syntactic noise. This is 
the last bit.

}


Now if you want to, you can at runtime create an OpenCL-code 
string (for example) by passing a heavily metaprogrammed type in 
as BufferT. The call ends up looking like this:



auto promisedFutureResult = Gpu.call!myFun(buffer);


The kernel compilation (assuming OpenCL) is memoized, and the 
promisedFutureResult is some asynchronous object that implements 
concurrent programming's future (or something to that extent). 
For convenience, let's say that it blocks on any read other than 
some special poll/checking mechanism.


The constraints imposed on the kernel functions is generalizable 
to even execute the code on the CPU, as the launching call ( 
Gpu.call!myFun(buffer) ) can, instead of using an 
expression-buffer, just pass a normal array in and have the 
proper result pop out given some interaction between the 
identifiers mixed in by KernelDefs and the launching caller (ex. 
using a loop).


With CTFE, this method *I think* can also generate the code at 
compile time given the proper kind of 
expression-type-recording-BufferT.


Again, though, this requires a significant amount of 
metaprogramming, heavy abuse of auto, and... did i mention a 
significant amount of metaprogrammi

Re: GPGPUs

2013-08-16 Thread Atash

On Friday, 16 August 2013 at 19:55:56 UTC, luminousone wrote:
The core (!) point here is that processor chips are rapidly 
becoming a
collection of heterogeneous cores. Any programming language 
that assumes

a single CPU or a collection of homogeneous CPUs has built-in
obsolescence.

So the question I am interested in is whether D is the 
language that can
allow me to express in a single codebase a program in which 
parts will
be executed on one or more GPGPUs and parts on multiple CPUs. 
D has

support for the latter, std.parallelism and std.concurrency.

I guess my question is whether people are interested in 
std.gpgpu (or

some more sane name).


CUDA, works as a preprocessor pass that generates c files from 
.cu extension files.


In effect, to create a sensible environment for microthreaded 
programming, they extend the language.


a basic CUDA function looking something like...

__global__ void add( float * a, float * b, float * c) {
   int i = threadIdx.x;
   c[i] = a[i] + b[i];
}

add<<< 1, 10 >>>( ptrA, ptrB, ptrC );

Their is the buildin variables to handle the index location 
threadIdx.x in the above example, this is something generated 
by the thread scheduler in the video card/apu device.


Generally calls to this setup has a very high latency, so using 
this for a small handful of items as in the above example makes 
no sense. In the above example that would end up using a single 
execution cluster, and leave you prey to the latency of the 
pcie bus, execution time, and latency costs of the video memory.


it doesn't get effective until you are working with large data 
sets, that can take advantage of a massive number of threads 
where the latency problems would be secondary to the sheer 
calculations done.


as far as D goes, we really only have one build in 
microthreading capable language construct, foreach.


However I don't think a library extension similar to 
std.parallelism would work gpu based microthreading.


foreach would need to have something to tell the compiler to 
generate gpu bytecode for the code block it uses, and would 
need instructions on when to use said code block based on 
dataset size.


while it is completely possible to have very little change with 
function just add new property @microthreaded and the build in 
variables for the index position/s, the calling syntax would 
need changes to support a work range or multidimensional range 
of some sort.


perhaps looking something like

add$(1 .. 10)(ptrA,ptrB,ptrC);

a templated function looking similar

add!(float)$(1 .. 10)(ptrA,ptrB,ptrC);


Regarding functionality, @microthreaded is sounding a lot like 
the __kernel or __global__ keywords in OpenCL and CUDA. Is this 
intentional?


The more metaphors that can be drawn between extant tools and 
whatever is come up with the better, methinks.


Re: GPGPUs

2013-08-16 Thread Atash

On Friday, 16 August 2013 at 12:18:49 UTC, Russel Winder wrote:

On Fri, 2013-08-16 at 12:41 +0200, Paul Jurczak wrote:
[…]
Today you have to download the kernel to the attached GPGPU 
over the
bus. In the near future the GPGPU will exist in a single memory 
address
space shared with all the CPUs. At this point separately 
downloadable
kernels become a thing of the past, it becomes a 
compiler/loader issue

to get things right.


I'm iffy on the assumption that the future holds unified memory 
for heterogeneous devices. Even relatively recent products such 
as the Intel Xeon Phi have totally separate memory. I'm not aware 
of any general-computation-oriented products that don't have 
separate memory.


I'm also of the opinion that as long as people want to have 
devices that can scale in size, there will be modular devices. 
Because they're modular, there's some sort of a spacing between 
them and the machine, ex. PCIe (and, somewhat importantly, a 
physical distance between the added device and the CPU-stuff). 
Because of that, they're likely to have their own memory. 
Therefore, I'm personally not willing to bank on anything short 
of targeting the least common denominator here (non-uniform 
access memory) specifically because it looks like a necessity for 
scaling a physical collection of heterogeneous devices up in 
size, which in turn I *think* is a necessity for people trying to 
deal with growing data sets in the real world.


Add because heterogeneous compute devices aren't 
*just* GPUs (ex. Intel Xeon Phi), I'd strongly suggest picking a 
more general name, like 'accelerators' or 'apu' (except AMD 
totally ran away with that acronym in marketing and I sort of 
hate them for it) or 
''.


That said, I'm no expert, so go ahead and rip 'mah opinions 
apart. :-D


Re: GPGPUs

2013-08-15 Thread Atash

On Tuesday, 13 August 2013 at 16:27:46 UTC, Russel Winder wrote:
The era of GPGPUs for Bitcoin mining are now over, they moved 
to ASICs.
The new market for GPGPUs is likely the banks, and other "Big 
Data"
folk. True many of the banks are already doing some GPGPU 
usage, but it

is not big as yet. But it is coming.

Most of the banks are either reinforcing their JVM commitment, 
via
Scala, or are re-architecting to C++ and Python. True there is 
some
C#/F# but it is all for terminals not for strategic computing, 
and it is
diminishing (despite what you might hear from .NET oriented 
training

companies).

Currently GPGPU tooling means C. OpenCL and CUDA (if you have 
to) are C
API for C coding. There are some C++ bindings. There are 
interesting
moves afoot with the JVM to enable access to GPGPU from Java, 
Scala,
Groovy, etc. but this is years away, which is a longer 
timescale than

the opportunity.

Python's offerings, PyOpenCL and PyCUDA are basically ways of 
managing C
coded kernels which rather misses the point. I may get involved 
in
trying to write an expression language in Python to go with 
PyOpenCL so
that kernels can be written in Python – a more ambitious 
version aimed

at Groovy is also mooted.

However, D has the opportunity of gaining a bridgehead if a 
combination
of D, PyD, QtD and C++ gets to be seen as a viable solid 
platform for
development.  The analogue here is the way Java is giving way 
to Scala
and Groovy, but in an evolutionary way as things all interwork. 
The
opportunity is for D to be seen as the analogue of Scala on the 
JVM for
the native code world: a language that interworks well with all 
the

other players on the platform but provides more.

The entry point would be if D had a way of creating GPGPU 
kernels that

is better than the current C/C++ + tooling.

This email is not a direct proposal to do work, just really an 
enquiry

to see if there is any interest in this area.


Clarifying question:

At what level is this interest pointed at? Is it at the level of 
assembly/IL and other scary stuff, or is it at creating bindings 
that are cleaner and providing more convenient tools?


'Cuz I'm seeing a lot of potential in D for a library-based 
solution, handling kernel code similar to how CUDA does (I think 
the word is 'conveniently'), 'cept with OpenCL or whatever 
lower-level-standard-that-exists-on-more-than-just-one-company's-hardware 
and abuse of the import() expression coupled with a heavy dose of 
metaprogramming magic.