On Fri, 20 May 2022, Raul Miller wrote:
Was that from intel?
I guess (after reading below) that that depends on what you mean by "from".
I want to know what _you_ mean by it.
there's going to be code which the compiler does not treat well.
Like what? Repeatedly changing the nameclass of some global name will be
rough, but who does that? Concurrent access is an issue I have been thinking
about, but if you try to access the same name repeatedly from multiple
threads--one of which accesses is a write--you will have to fight over it
anyway.
Polymorphic call sites are a known failure mode for inline caching, but 1)
they can be mitigated with inlining, and 2) I expect that arrays are large
enough that the compilation overhead will nearly always be worthwhile (and
that in the cases where the operation is very cheap, it will be possible to
detect that and fall back to an interpreter).
That said, my reading of that paper suggests results not all that different
in character from J's "special code" (though different in implementation).
'Sort of'. J has an awful failure mode; consider what happens when I write
1+2*x, for some very large array x. This will load each element of x into
memory, multiply it by 2, and write it back out; then load each element
_again_, add 1 to it, and then write it back out _again_. Half the memory
accesses here are pure overhead, and this will likely perform only half as
well as it should. Cache-blocking will help, but, in the limit, not enough.
If I am a compiler, I can fuse these loops and generate really tight,
beautiful code.
Now, 'superinstructions' can be used in tandem with special combinations to
effect something a _bit_ similar, but as far as I know, no one has done this
yet.
Add to which that a proper compiler would be much easier to use; for instance,
you would be able to write explicit code and still take advantage of special
combinations; x([-.-.)y would look the same as x-.x-.y, to a compiler.
When I think about compiling J, I envision something of a hierarchy, with
singly typed rank 0 operations out at the leaves, rank 1 operations
somewhere in the middle and the full J language somewhere behind all that.
When dealing with large data structures, it's fine to defer decisions about
those singly typed rank 0 operations to the language primitives at execution
time. But the stuff you're talking about really wants those decisions moved
to earlier in the process and not repeated every time the operation is
performed.
When I think about compiling j, I imagine a dataflow graph. Which graph has
explicit representations of rank (both function rank and verb rank, which turn
out to be sort of unifiable), and potentially shape, as well as a cost model.
The parts of the graph corresponding to inputs will have requirements for
those inputs: datatype, rank, maybe shape. If an input does not satisfy the
requirements, the original code will be re-specialised for those, and then
added to the inline cache. A collection of heuristics will decide what to
fuse, what to pipeline, and which computational work to assign to which
compute nodes. An empirical result is that 1) branches are free; 2) most call
sites are monomorphic.
And, I would argue that the caching architectures of modern machines benefit
from a similar organization.
*snip*
Both CPU microarchitecture and data locality are very complex but broadly well
understood topics; you are giving a somewhat simplistic view of them here.
Access to L1d incurs 4-5 cycles of _latency_. Meaning that, if some operation
requires as input a value stored in memory present in l1, that operation will
not be able to begin until 4-5 cycles after the value has been requested.
In particular, if you have something else you can do while waiting for the
value to get back from the cache, then the access is free. That is what
'superscalar' means. Ensuring that there is something else to do is
accomplished in the large by the programmer, but in the small (over time, less
and less small...) by the instruction scheduler and the instruction
reorderer--duals, one of which is part of the compiler and one of which is
part of the cpu.
Memory accesses are not the only sort of pipelined operation which has greater
throughput than latency. Ordinary operations like addition have _exactly_ the
same behaviour, and reducing the depth of dependency chains there has exactly
the same (positive) effect on performance.
The question of whether a memory access goes out to main memory is not an
obvious one, nor is the cost of various sorts of misses. For instance, a some
research presented at PLDI a year or two ago by ocaml developers found that
iteration of a singly-linked list was free, if only you prefetched two nodes
ahead. GC technology is very advanced these days and can automatically
arrange to improve the locality of running programs.
There are important qualitative differences between the data-parallel
gpu-oriented model and the pointer-chasing cpu-oriented model. Regardless of
whether long dependency chains are fundamentally slow or not, some problems
fundamentally _require_ long dependency chains to solve, and pointer-chasers
are simply better at dealing with this sort of situation. And while
parallelism benefits all around, pointer-chasers and cpus are better at more
heterogeneous problems. (Though I do think forks in j are a Good Idea, and
that the fork (f (g >) h t.'') is an important one.
I am not sure exactly what you mean by:
Compared to 1 nanosecond for L1 cache, this factor of 80 can easily be
larger than O(log n).
I assume it is an exhortation to consider the value of constant factors. But
I don't see what it signifies.
-E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm