This thing is turning into fascinating reading, thanks.

There is nevertheless a small tidbit of information concerning the
creation  of numbered locales I would like to add.

As pointed out:
   locs=:  cocreate @''@>i.8
I would add that special code is accepted as well:
  locs =: cocreate &'' i.8
The first line will work for any verb, the second is specific to cocreate
and, as far as I can tell, not documented.





Le sam. 21 mai 2022, à 02 h 37, Raul Miller <rauldmil...@gmail.com> a
écrit :

> On Sat, May 21, 2022 at 1:30 AM Elijah Stone <elro...@elronnd.net> wrote:
> > On Fri, 20 May 2022, Raul Miller wrote:
> > >> Was that from intel?
> > >
> > > I guess (after reading below) that that depends on what you mean by
> "from".
> >
> > I want to know what _you_ mean by it.
>
> I meant that as a general rule, I see it being used in intel
> documentation more than anywhere else, by an order of magnitude.
>
> This might reflect on my choice of search terms, but I'm not sure that
> I should be referring to documentation which I do not find.
>
> > > there's going to be code which the compiler does not treat well.
> >
> > Like what?  Repeatedly changing the nameclass of some global name will be
> > rough, but who does that?  Concurrent access is an issue I have been
> thinking
> > about, but if you try to access the same name repeatedly from multiple
> > threads--one of which accesses is a write--you will have to fight over it
> > anyway.
>
> It's largely about mapping language constructs onto underlying type
> systems.
>
> For example, hardware has a variety of fixed size constraints and in
> some contexts you get significant benefits using constructs which fit
> those constraints.
>
> > Polymorphic call sites are a known failure mode for inline caching, but
> 1)
> > they can be mitigated with inlining, and 2) I expect that arrays are
> large
> > enough that the compilation overhead will nearly always be worthwhile
> (and
> > that in the cases where the operation is very cheap, it will be possible
> to
> > detect that and fall back to an interpreter).
>
> And, code analysis which identifies constraints on that polymorphism
> is a well known compilation technique.
>
> But, also, if pointer chasing performance adequately describes your
> context, that implies you're going to be dealing with at least some
> small arrays which are both type constrained and performance critical.
>
> > > That said, my reading of that paper suggests results not all that
> different
> > > in character from J's "special code" (though different in
> implementation).
> >
> > 'Sort of'.  J has an awful failure mode; consider what happens when I
> write
> > 1+2*x, for some very large array x.  This will load each element of x
> into
> > memory, multiply it by 2, and write it back out; then load each element
> > _again_, add 1 to it, and then write it back out _again_. Half the memory
> > accesses here are pure overhead, and this will likely perform only half
> as
> > well as it should.  Cache-blocking will help, but, in the limit, not
> enough.
> > If I am a compiler, I can fuse these loops and generate really tight,
> > beautiful code.
> >
> > Now, 'superinstructions' can be used in tandem with special combinations
> to
> > effect something a _bit_ similar, but as far as I know, no one has done
> this
> > yet.
>
> Yes.
>
> > Add to which that a proper compiler would be much easier to use; for
> instance,
> > you would be able to write explicit code and still take advantage of
> special
> > combinations; x([-.-.)y would look the same as x-.x-.y, to a compiler.
>
> Right, but consider 1+2*x vs a+b*x. The former is much easier for a
> compiler to optimize, because the type, shape and values of 1 and 2
> are constrained while a and b are not constrained (and, in the general
> case, might not even be nouns).
>
> That said, if I remember right, code which accesses memory in a serial
> pattern gives the cpu enough information that it can speculatively
> fetch the next block of memory before it's needed. (However, I don't
> remember the jargon which was used to describe this mechanism, and
> don't know how to search for reference material discussing this
> issue.)
>
> > When I think about compiling j, I imagine a dataflow graph.  Which graph
> has
> > explicit representations of rank (both function rank and verb rank,
> which turn
> > out to be sort of unifiable), and potentially shape, as well as a cost
> model.
> > The parts of the graph corresponding to inputs will have requirements for
> > those inputs: datatype, rank, maybe shape.  If an input does not satisfy
> the
> > requirements, the original code will be re-specialised for those, and
> then
> > added to the inline cache.  A collection of heuristics will decide what
> to
> > fuse, what to pipeline, and which computational work to assign to which
> > compute nodes.  An empirical result is that 1) branches are free; 2)
> most call
> > sites are monomorphic.
>
> I expect that the cost model gets tricky, and might need to be
> simplified or oversimplified with consequences which will require some
> informed judgements.
>
> That said, "most call sites are monomorphic" is a good example of what
> I meant when I was talking about compiler constraints. In practice
> that winds up being an issue which coders need to be aware of and
> manage.
>
> (However.. I'm not sure that "branches are free" is a good model.
> Branches have a cost which is negligible in some contexts.)
>
> > Both CPU microarchitecture and data locality are very complex but
> broadly well
> > understood topics; you are giving a somewhat simplistic view of them
> here.
>
> They are broadly understood, but finding the specifics is difficult --
> there's some significant variation between different machines.
>
> > Access to L1d incurs 4-5 cycles of _latency_.  Meaning that, if some
> operation
> > requires as input a value stored in memory present in l1, that operation
> will
> > not be able to begin until 4-5 cycles after the value has been requested.
>
> Yes.
>
> > In particular, if you have something else you can do while waiting for
> the
> > value to get back from the cache, then the access is free.  That is what
> > 'superscalar' means.  Ensuring that there is something else to do is
> > accomplished in the large by the programmer, but in the small (over
> time, less
> > and less small...) by the instruction scheduler and the instruction
> > reorderer--duals, one of which is part of the compiler and one of which
> is
> > part of the cpu.
>
> This depends on the algorithm.
>
> And, while something like a binary tree search might be phrasable in a
> way which occupies the time waiting for L1, an L3 fetch is a different
> matter.
>
> > Memory accesses are not the only sort of pipelined operation which has
> greater
> > throughput than latency.  Ordinary operations like addition have
> _exactly_ the
> > same behaviour, and reducing the depth of dependency chains there has
> exactly
> > the same (positive) effect on performance.
>
> I'm not sure what you're saying here.
>
> I mean, sure -- if you are adding to values which need to hit memory,
> memory access latency is going to be an issue.
>
> But if you meant something else, it went over my head.
>
> > The question of whether a memory access goes out to main memory is not an
> > obvious one, nor is the cost of various sorts of misses.  For instance,
> a some
> > research presented at PLDI a year or two ago by ocaml developers found
> that
> > iteration of a singly-linked list was free, if only you prefetched two
> nodes
> > ahead.  GC technology is very advanced these days and can automatically
> > arrange to improve the locality of running programs.
>
> That's an interesting insight, though I think it begs the question:
> ahead of what?
>
> > There are important qualitative differences between the data-parallel
> > gpu-oriented model and the pointer-chasing cpu-oriented model.
> Regardless of
> > whether long dependency chains are fundamentally slow or not, some
> problems
> > fundamentally _require_ long dependency chains to solve, and
> pointer-chasers
> > are simply better at dealing with this sort of situation.  And while
> > parallelism benefits all around, pointer-chasers and cpus are better at
> more
> > heterogeneous problems. (Though I do think forks in j are a Good Idea,
> and
> > that the fork (f (g >) h t.'') is an important one.
>
> Certainly.
>
> But, of course, the details matter.
>
> > I am not sure exactly what you mean by:
> >
> > > Compared to 1 nanosecond for L1 cache, this factor of 80 can easily be
> > > larger than O(log n).
>
> 80ns to fetch from main memory vs. 1 nanosecond to fetch from L1 cache
> is a factor of 80.
>
> > I assume it is an exhortation to consider the value of constant
> factors.  But
> > I don't see what it signifies.
>
> I thought I pointed that out.
>
> I'll try stating it differently: when comparing algorithms, 80 is
> bigger than log(n) for any value of n that matters in a performance
> comparison.
>
>    2^80x
> 1208925819614629174706176
>
> Nobody works with data sets which are that large.
>
> --
> Raul
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to