On Fri, 20 May 2022, Raul Miller wrote:

Was that from intel?

I guess (after reading below) that that depends on what you mean by "from".

I want to know what _you_ mean by it.

there's going to be code which the compiler does not treat well.

Like what? Repeatedly changing the nameclass of some global name will be rough, but who does that? Concurrent access is an issue I have been thinking about, but if you try to access the same name repeatedly from multiple threads--one of which accesses is a write--you will have to fight over it anyway.

Polymorphic call sites are a known failure mode for inline caching, but 1) they can be mitigated with inlining, and 2) I expect that arrays are large enough that the compilation overhead will nearly always be worthwhile (and that in the cases where the operation is very cheap, it will be possible to detect that and fall back to an interpreter).


That said, my reading of that paper suggests results not all that different in character from J's "special code" (though different in implementation).

'Sort of'. J has an awful failure mode; consider what happens when I write 1+2*x, for some very large array x. This will load each element of x into memory, multiply it by 2, and write it back out; then load each element _again_, add 1 to it, and then write it back out _again_. Half the memory accesses here are pure overhead, and this will likely perform only half as well as it should. Cache-blocking will help, but, in the limit, not enough. If I am a compiler, I can fuse these loops and generate really tight, beautiful code.

Now, 'superinstructions' can be used in tandem with special combinations to effect something a _bit_ similar, but as far as I know, no one has done this yet.

Add to which that a proper compiler would be much easier to use; for instance, you would be able to write explicit code and still take advantage of special combinations; x([-.-.)y would look the same as x-.x-.y, to a compiler.


When I think about compiling J, I envision something of a hierarchy, with singly typed rank 0 operations out at the leaves, rank 1 operations somewhere in the middle and the full J language somewhere behind all that.

When dealing with large data structures, it's fine to defer decisions about those singly typed rank 0 operations to the language primitives at execution time. But the stuff you're talking about really wants those decisions moved to earlier in the process and not repeated every time the operation is performed.

When I think about compiling j, I imagine a dataflow graph. Which graph has explicit representations of rank (both function rank and verb rank, which turn out to be sort of unifiable), and potentially shape, as well as a cost model. The parts of the graph corresponding to inputs will have requirements for those inputs: datatype, rank, maybe shape. If an input does not satisfy the requirements, the original code will be re-specialised for those, and then added to the inline cache. A collection of heuristics will decide what to fuse, what to pipeline, and which computational work to assign to which compute nodes. An empirical result is that 1) branches are free; 2) most call sites are monomorphic.


And, I would argue that the caching architectures of modern machines benefit from a similar organization.

*snip*

Both CPU microarchitecture and data locality are very complex but broadly well understood topics; you are giving a somewhat simplistic view of them here.

Access to L1d incurs 4-5 cycles of _latency_. Meaning that, if some operation requires as input a value stored in memory present in l1, that operation will not be able to begin until 4-5 cycles after the value has been requested.

In particular, if you have something else you can do while waiting for the value to get back from the cache, then the access is free. That is what 'superscalar' means. Ensuring that there is something else to do is accomplished in the large by the programmer, but in the small (over time, less and less small...) by the instruction scheduler and the instruction reorderer--duals, one of which is part of the compiler and one of which is part of the cpu.

Memory accesses are not the only sort of pipelined operation which has greater throughput than latency. Ordinary operations like addition have _exactly_ the same behaviour, and reducing the depth of dependency chains there has exactly the same (positive) effect on performance.

The question of whether a memory access goes out to main memory is not an obvious one, nor is the cost of various sorts of misses. For instance, a some research presented at PLDI a year or two ago by ocaml developers found that iteration of a singly-linked list was free, if only you prefetched two nodes ahead. GC technology is very advanced these days and can automatically arrange to improve the locality of running programs.

There are important qualitative differences between the data-parallel gpu-oriented model and the pointer-chasing cpu-oriented model. Regardless of whether long dependency chains are fundamentally slow or not, some problems fundamentally _require_ long dependency chains to solve, and pointer-chasers are simply better at dealing with this sort of situation. And while parallelism benefits all around, pointer-chasers and cpus are better at more heterogeneous problems. (Though I do think forks in j are a Good Idea, and that the fork (f (g >) h t.'') is an important one.

I am not sure exactly what you mean by:

Compared to 1 nanosecond for L1 cache, this factor of 80 can easily be larger than O(log n).

I assume it is an exhortation to consider the value of constant factors. But I don't see what it signifies.

 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to