Re: [Jprogramming] locale error in Vocabulary/Locales

Elijah Stone Fri, 20 May 2022 22:30:37 -0700

On Fri, 20 May 2022, Raul Miller wrote:

Was that from intel?


I guess (after reading below) that that depends on what you mean by "from".


I want to know what _you_ mean by it.

there's going to be code which the compiler does not treat well.

Like what? Repeatedly changing the nameclass of some global name will berough, but who does that? Concurrent access is an issue I have been thinkingabout, but if you try to access the same name repeatedly from multiplethreads--one of which accesses is a write--you will have to fight over itanyway.

Polymorphic call sites are a known failure mode for inline caching, but 1)they can be mitigated with inlining, and 2) I expect that arrays are largeenough that the compilation overhead will nearly always be worthwhile (andthat in the cases where the operation is very cheap, it will be possible todetect that and fall back to an interpreter).

That said, my reading of that paper suggests results not all that differentin character from J's "special code" (though different in implementation).

'Sort of'. J has an awful failure mode; consider what happens when I write1+2*x, for some very large array x. This will load each element of x intomemory, multiply it by 2, and write it back out; then load each element_again_, add 1 to it, and then write it back out _again_. Half the memoryaccesses here are pure overhead, and this will likely perform only half aswell as it should. Cache-blocking will help, but, in the limit, not enough.If I am a compiler, I can fuse these loops and generate really tight,beautiful code.

Now, 'superinstructions' can be used in tandem with special combinations toeffect something a _bit_ similar, but as far as I know, no one has done thisyet.

Add to which that a proper compiler would be much easier to use; for instance,you would be able to write explicit code and still take advantage of specialcombinations; x([-.-.)y would look the same as x-.x-.y, to a compiler.

When I think about compiling J, I envision something of a hierarchy, withsingly typed rank 0 operations out at the leaves, rank 1 operationssomewhere in the middle and the full J language somewhere behind all that.
When dealing with large data structures, it's fine to defer decisions aboutthose singly typed rank 0 operations to the language primitives at executiontime. But the stuff you're talking about really wants those decisions movedto earlier in the process and not repeated every time the operation isperformed.

When I think about compiling j, I imagine a dataflow graph. Which graph hasexplicit representations of rank (both function rank and verb rank, which turnout to be sort of unifiable), and potentially shape, as well as a cost model.The parts of the graph corresponding to inputs will have requirements forthose inputs: datatype, rank, maybe shape. If an input does not satisfy therequirements, the original code will be re-specialised for those, and thenadded to the inline cache. A collection of heuristics will decide what tofuse, what to pipeline, and which computational work to assign to whichcompute nodes. An empirical result is that 1) branches are free; 2) most callsites are monomorphic.

And, I would argue that the caching architectures of modern machines benefitfrom a similar organization.
*snip*

Both CPU microarchitecture and data locality are very complex but broadly wellunderstood topics; you are giving a somewhat simplistic view of them here.

Access to L1d incurs 4-5 cycles of _latency_. Meaning that, if some operationrequires as input a value stored in memory present in l1, that operation willnot be able to begin until 4-5 cycles after the value has been requested.

In particular, if you have something else you can do while waiting for thevalue to get back from the cache, then the access is free. That is what'superscalar' means. Ensuring that there is something else to do isaccomplished in the large by the programmer, but in the small (over time, lessand less small...) by the instruction scheduler and the instructionreorderer--duals, one of which is part of the compiler and one of which ispart of the cpu.

Memory accesses are not the only sort of pipelined operation which has greaterthroughput than latency. Ordinary operations like addition have _exactly_ thesame behaviour, and reducing the depth of dependency chains there has exactlythe same (positive) effect on performance.

The question of whether a memory access goes out to main memory is not anobvious one, nor is the cost of various sorts of misses. For instance, a someresearch presented at PLDI a year or two ago by ocaml developers found thatiteration of a singly-linked list was free, if only you prefetched two nodesahead. GC technology is very advanced these days and can automaticallyarrange to improve the locality of running programs.

There are important qualitative differences between the data-parallelgpu-oriented model and the pointer-chasing cpu-oriented model. Regardless ofwhether long dependency chains are fundamentally slow or not, some problemsfundamentally _require_ long dependency chains to solve, and pointer-chasersare simply better at dealing with this sort of situation. And whileparallelism benefits all around, pointer-chasers and cpus are better at moreheterogeneous problems. (Though I do think forks in j are a Good Idea, andthat the fork (f (g >) h t.'') is an important one.


I am not sure exactly what you mean by:

Compared to 1 nanosecond for L1 cache, this factor of 80 can easily belarger than O(log n).

I assume it is an exhortation to consider the value of constant factors. ButI don't see what it signifies.


 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] locale error in Vocabulary/Locales

Reply via email to