On Sat, May 21, 2022 at 1:30 AM Elijah Stone <elro...@elronnd.net> wrote:
> On Fri, 20 May 2022, Raul Miller wrote:
> >> Was that from intel?
> >
> > I guess (after reading below) that that depends on what you mean by "from".
>
> I want to know what _you_ mean by it.

I meant that as a general rule, I see it being used in intel
documentation more than anywhere else, by an order of magnitude.

This might reflect on my choice of search terms, but I'm not sure that
I should be referring to documentation which I do not find.

> > there's going to be code which the compiler does not treat well.
>
> Like what?  Repeatedly changing the nameclass of some global name will be
> rough, but who does that?  Concurrent access is an issue I have been thinking
> about, but if you try to access the same name repeatedly from multiple
> threads--one of which accesses is a write--you will have to fight over it
> anyway.

It's largely about mapping language constructs onto underlying type systems.

For example, hardware has a variety of fixed size constraints and in
some contexts you get significant benefits using constructs which fit
those constraints.

> Polymorphic call sites are a known failure mode for inline caching, but 1)
> they can be mitigated with inlining, and 2) I expect that arrays are large
> enough that the compilation overhead will nearly always be worthwhile (and
> that in the cases where the operation is very cheap, it will be possible to
> detect that and fall back to an interpreter).

And, code analysis which identifies constraints on that polymorphism
is a well known compilation technique.

But, also, if pointer chasing performance adequately describes your
context, that implies you're going to be dealing with at least some
small arrays which are both type constrained and performance critical.

> > That said, my reading of that paper suggests results not all that different
> > in character from J's "special code" (though different in implementation).
>
> 'Sort of'.  J has an awful failure mode; consider what happens when I write
> 1+2*x, for some very large array x.  This will load each element of x into
> memory, multiply it by 2, and write it back out; then load each element
> _again_, add 1 to it, and then write it back out _again_. Half the memory
> accesses here are pure overhead, and this will likely perform only half as
> well as it should.  Cache-blocking will help, but, in the limit, not enough.
> If I am a compiler, I can fuse these loops and generate really tight,
> beautiful code.
>
> Now, 'superinstructions' can be used in tandem with special combinations to
> effect something a _bit_ similar, but as far as I know, no one has done this
> yet.

Yes.

> Add to which that a proper compiler would be much easier to use; for instance,
> you would be able to write explicit code and still take advantage of special
> combinations; x([-.-.)y would look the same as x-.x-.y, to a compiler.

Right, but consider 1+2*x vs a+b*x. The former is much easier for a
compiler to optimize, because the type, shape and values of 1 and 2
are constrained while a and b are not constrained (and, in the general
case, might not even be nouns).

That said, if I remember right, code which accesses memory in a serial
pattern gives the cpu enough information that it can speculatively
fetch the next block of memory before it's needed. (However, I don't
remember the jargon which was used to describe this mechanism, and
don't know how to search for reference material discussing this
issue.)

> When I think about compiling j, I imagine a dataflow graph.  Which graph has
> explicit representations of rank (both function rank and verb rank, which turn
> out to be sort of unifiable), and potentially shape, as well as a cost model.
> The parts of the graph corresponding to inputs will have requirements for
> those inputs: datatype, rank, maybe shape.  If an input does not satisfy the
> requirements, the original code will be re-specialised for those, and then
> added to the inline cache.  A collection of heuristics will decide what to
> fuse, what to pipeline, and which computational work to assign to which
> compute nodes.  An empirical result is that 1) branches are free; 2) most call
> sites are monomorphic.

I expect that the cost model gets tricky, and might need to be
simplified or oversimplified with consequences which will require some
informed judgements.

That said, "most call sites are monomorphic" is a good example of what
I meant when I was talking about compiler constraints. In practice
that winds up being an issue which coders need to be aware of and
manage.

(However.. I'm not sure that "branches are free" is a good model.
Branches have a cost which is negligible in some contexts.)

> Both CPU microarchitecture and data locality are very complex but broadly well
> understood topics; you are giving a somewhat simplistic view of them here.

They are broadly understood, but finding the specifics is difficult --
there's some significant variation between different machines.

> Access to L1d incurs 4-5 cycles of _latency_.  Meaning that, if some operation
> requires as input a value stored in memory present in l1, that operation will
> not be able to begin until 4-5 cycles after the value has been requested.

Yes.

> In particular, if you have something else you can do while waiting for the
> value to get back from the cache, then the access is free.  That is what
> 'superscalar' means.  Ensuring that there is something else to do is
> accomplished in the large by the programmer, but in the small (over time, less
> and less small...) by the instruction scheduler and the instruction
> reorderer--duals, one of which is part of the compiler and one of which is
> part of the cpu.

This depends on the algorithm.

And, while something like a binary tree search might be phrasable in a
way which occupies the time waiting for L1, an L3 fetch is a different
matter.

> Memory accesses are not the only sort of pipelined operation which has greater
> throughput than latency.  Ordinary operations like addition have _exactly_ the
> same behaviour, and reducing the depth of dependency chains there has exactly
> the same (positive) effect on performance.

I'm not sure what you're saying here.

I mean, sure -- if you are adding to values which need to hit memory,
memory access latency is going to be an issue.

But if you meant something else, it went over my head.

> The question of whether a memory access goes out to main memory is not an
> obvious one, nor is the cost of various sorts of misses.  For instance, a some
> research presented at PLDI a year or two ago by ocaml developers found that
> iteration of a singly-linked list was free, if only you prefetched two nodes
> ahead.  GC technology is very advanced these days and can automatically
> arrange to improve the locality of running programs.

That's an interesting insight, though I think it begs the question:
ahead of what?

> There are important qualitative differences between the data-parallel
> gpu-oriented model and the pointer-chasing cpu-oriented model. Regardless of
> whether long dependency chains are fundamentally slow or not, some problems
> fundamentally _require_ long dependency chains to solve, and pointer-chasers
> are simply better at dealing with this sort of situation.  And while
> parallelism benefits all around, pointer-chasers and cpus are better at more
> heterogeneous problems. (Though I do think forks in j are a Good Idea, and
> that the fork (f (g >) h t.'') is an important one.

Certainly.

But, of course, the details matter.

> I am not sure exactly what you mean by:
>
> > Compared to 1 nanosecond for L1 cache, this factor of 80 can easily be
> > larger than O(log n).

80ns to fetch from main memory vs. 1 nanosecond to fetch from L1 cache
is a factor of 80.

> I assume it is an exhortation to consider the value of constant factors.  But
> I don't see what it signifies.

I thought I pointed that out.

I'll try stating it differently: when comparing algorithms, 80 is
bigger than log(n) for any value of n that matters in a performance
comparison.

   2^80x
1208925819614629174706176

Nobody works with data sets which are that large.

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to