Re: Registers vs. Stack
On Thursday 21 August 2003 21:40, Brent Dax wrote: > # we're already running with a faster opcode dispatch Man I wish I had the time to keep up with parrot development. Though, as others have pointed out, the core archetecture is somewhat solidified by this point, I thought I'd put in my two cents worth. I completely agree that stack machines are for wimps ;) But I have a problem with some peoples representation of stack machines. When was the last modern real-CPU that actually performed push/pop operations for their stack? That entire argument is moot in my opinion. Look at the sparc chip as an example. You have a set of pre-defined directly mappable registers which are appended to the stack, then you have your input parameters, your worst-case output parameters, and your local spill variables; all of which are pre-calculated at compile time, then a single number is computed. At the entry and exit of each function call, that number is added to and subtracted from the stack. All subsequent "stack operations" are simply "ld/st [sp + offset], reg". If you were balsy enough, you could do global variable allocation, but depending on whether you're performing relocatable-code, you might still have to add the address to your Instruction-Pointer. Thus short of always having enough registers, you have to perform offset calculations, which is not much different than stack pushes/pops. But the paradigm is different. But there's another issue that I've seen brought up. By statically allocating spill/input/output variables to an offset of the stack pointer, you rid yourself the issue of "where was that variable in the mix of pushes and pops".. You're garunteed that a variable is at a specific address, albeit a relative address. There is no difference between performing add R1, 5 # R1 += 5 then add [SP+1], 5 Especially if at the opcode executing level, R1 is defined as SP+R1_OFFSET Taking the register-spill analogy back to JITing. We don't know how big the CPU register set is at parrot compile-time, so we don't know what a good register-set-size is. x86's are sadly still treated as accumulators (even with x86-64), there are just too many cool compiler techniques that don't work unless you have 32+ GPRs, so it's hardly worth the effort to test for possible optimizations with only 8. On the other hand, IA-64 with 100+ GPRs can unroll loops and map temporaries like there's no tomorrow. The end result is that a dynamically sized register-set is probably the ideal for a VM. If the compiler can assume that you have as many registers as you need, but is given the constraint of "please try to not use any more than you absolutely need" (a la generic chaitin or Chow (basic-block based)), then in the rare case that an Itanium is in use, a full register mapping can occur. If we need to resort to accumulatoring, then you can utilize a raw vmStackFrame + offset, wheren vmstack is register. It's also possible (albeit not as obvious) to have a hybrid of "map first n variables to physical registers" for the common case of 32reg machines. Now in the case of Parrot, our stack (last I checked) was not homogeneous, so this simplistic case wouldn't work well. But there are two solutions that immediately occur to me. Soln A) Treat the datatype as trusted-opaque, and large enough to handle the largest data-type. e.x. iadd R1 <= R2, R3 sconcat R4 <= R5, R6 etc.. We merely trust that the compiler won't mix and match data-types into offset assignments. We would still, of course need to properly handle gc-DOD through the stack, so we couldn't be completely opaque. Input parameters to functions would have to either be staticly sized, or there would have to be a special op-code to access dynamically-sized input parameters of unknown types. A simple opcode regAlloc(numInputRegs, numLocalRegs) would shift the frame pointer such that numInputRegs become regs 1..numInputRegs, and the locals become numInputRegs .. numInputRegs+numLocalRegs. This is somewhat similar to the Itanium register-allocating style. Soln B) Have a multitude of homogeneous stacks. This is identical to solution A, but trades complexity for performance. Namely, there would be: intStack fpStack strStack objStack The reg-allocation op-code would also require 4 pairs of sizes. Additionally, the compiler must maintain 4 seperate input/output/local variable->register mappings. The advantages are: * no problems with typecasting parameter problems * gcing is more efficient (garunteeded that all non-null refs found in str/obj stacks need DODing / dont need to test the stack-element-type on each iteration). * more properly maps to inter/floating point register sets.. The str/obj stacks need external referencing anyway. Well, again, just my $0.2. But I just felt the need to defend "practical" stack computing.
Re: Interpreter memory allocation
On Fri, 15 Feb 2002, Alex Gough wrote: > On Thu, 14 Feb 2002, Dan Sugalski wrote: > > > To allocate memory that is GCable, call Parrot_allocate(interpreter, > > size). Then stash the pointer and size in your buffer header, or > > it'll go missing later. To resize a chunk of memory, call > > mem_realloc(interpreter, from, oldsize, newsize). > > Having to pass in the oldsize makes it very tricky to use wrappers > around around mem_realloc which work when code isn't in parrot, is it > not possible to have the memory pools themselves be a bit more like > malloc/realloc in tracking allocated sizes (I imagine they need to do > this anyway, if GC is to free chunks appropriately)? While this is certainly possible, I believe the intent was to allow the Parrot core to have as much control and access to the data-set as possible. Normally you'd have to use gc_mem_sizeof( ptr ) to get the size. This would encourage caching of the size value (possibly in the local structure.. By having the size shared by both the user-defined structure and the gc, you avoid both the function call and the wasted memory. This is just speculation on my part though. > > > If your buffer header can't be reached from the root set, you'll end > > up having it reclaimed when a sweep is made. > > Which bits of a PMC count as being reachable from the root set? Everything in any stack (including register stacks) plus PMCs shared between threads and finally those memory chunks explicitly registered as foreign (e.g. a manually administered reference).
Re: Configger this.
On Fri, 7 Dec 2001, Andy Dougherty wrote: > On Fri, 7 Dec 2001, Bryan C. Warnock wrote: > > > On Friday 07 December 2001 08:43 am, Andy Dougherty wrote: > > > Funny you should mention that, because Perl's Configure does things in > > > order determined by 'Dependency-ish rules, a la make'. Configure is > > > indeed built in just the way you suggest. > > > Except, of course, for being one big honking file. > > That's a mere implementation detail :-). (Though one that's admittedly > quite intimidating!) It isn't one big file until the very very end step. > There's no reason it couldn't be a file that simply called a series of > tiny scripts > > #!/bin/sh > . U/intro > . U/find-shell > . U/cmdline > . U/find-hints > > etc., except that it would then be even slower than it already is. > (Probably not a significant effect now, but it would have been when > Configure was originally designed.) I was thinking more along the lines of BSD-booting: #!/bin/sh for x in `ls U/* | sort`; do $x done Performance really shouldn't be that big of a deal, though ls/sort + the whole shell-script nature are rather unix-centric. A perl5 based configure could easily be: #!perl opendir DH, "U"; @files = sort { $a cmp $b } grep { /^\d/ } readdir DH; system($_) for @files; The reason for execing, and not sourcing was to allow local variables within each module (a-la the BSD inet.d files), but that's optional if performance is really an issue (don't see why it should) Now People just add conf-files and produce their own dependancy list. Since we're comparing strings, not numbers, sub-versions work like the library of congress.. 1 2 (dep on 1) 3 (dep on 2) 25 (comes before 3) 26 255 (comes before 26) Though that might look a bit confusing; we can always use "%02i" notation. -Michael
Re: [PATCH classes/perlnum.pmc] Use C please, not C++.
While your point is taken, it's hardly considered "C++" anymore. Many C-compilers have adopted many such useful features. On Wed, 28 Nov 2001, Andy Dougherty wrote: > diff -r -u parrot-current/classes/perlnum.pmc parrot-andy/classes/perlnum.pmc > void set_integer (PMC* value) { > -//SELF->vtable = &(Parrot_base_vtables[enum_class_PerlInt]); > + /* SELF->vtable = &(Parrot_base_vtables[enum_class_PerlInt]); */ > SELF->cache.num_val = (FLOATVAL)value->vtable->get_integer(INTERP, value); > } -Michael
mem manager direction
I've done a bunch of reading, and though I'm not finished, I'm starting to look towards the following overall algorithm based on the below specified assumptions. I'm not _necessarily_ looking for comments at this point, because I haven't finished evaluating the specifics of several algorithms, but feel free to comment. Foreward: I'm looking at a hybrid of several algorithms, since different regions have different advantages / disadvantages. The typical downside of hybriding is extra overhead in the inner-loop. Here, the cost is in a greater number of branches in the inner loop. You can skip to the posted algorithm pseduo-code for a quick overview. Since we're inlining the data-size, I'm selecting which sub-algorithm to use based on the size. The data-type (beit a raw character-buffer like a string, or a nested data-structure such as an Array) will have to be known and thus interleaved within the GC for now. What I'd like to do is define vtableish ops for each data-type, and have the perl-make-script generate the core routine by concatenating the different types. I don't want the overhead of a vtable function call, but I don't want to expose the garbage collector to extension writers. More to come with this. Small data-types ( size <= 256 Bytes): Ideally, there's zero header-space due to the relative space-cost, and since we're already using the size this should be possible. For now, I'm looking towards a generic copying-collector. Generational schemes can require additional header-space (either for marking or boundry-references). The only advantage of making this generational would be to minimize the scan-depth (most generational algorithms don't decend through data-structures living in aged heaps), but this requires a write-barrier, and I'm currently opposed to that (defeats much of the advantage of abstracted garbage collection over ref-counting). I'm still considering this for terminal (non descending) data-types. I figure small data-types will be the most common by far, and so the less work done per object the better. Further, since each object will probably be visited on each reclaimation stage, we don't get to take advantage of anti-swap-thrashing techniques without imposing relatively high out-of-band overhead. The associated heap is grown (doubled in size) whenever less than 33% of the heap is available after a reclaimation. Unfortunately we don't know that we want to grow the heap until we've finished reclaiming, and a regrowth involves another copying phase, so it's deferred until the next collection. I found that this isn't so bad if we keep the max data-size much less than 33% of the heap. I'm looking at 64K as a starting heap-size, thus 256B is much less than 21K. Medium data-types ( 256 < size <= page-size ): Here we can afford header-prefixes, while copies are more expensive (relative to other operations). Allocations in this region can be pretty eratic and thus hard to coallesce, so I'm looking at using a generational scheme. Unfortunately many such schemes "age" the wrong data, or require a lot of extra copying. Some of the nicer schemes only require comparison to a water-mark (below which an object is considered aged, and above which the object is considered young and thus likely to expire soon). Such watermark schemes allow adaptive modification of the water-mark(s). In other words, when we find that most data is not dying, we want to raise the water-mark (to avoid unneeded copying). When we find that a lot of aged data is dying, we want to lower the water-mark (to alleviate garbage in the aged region). I'm currenty fiddling with a compacting collector here. This requires two stages. The first stage applies a "version" tag prefixed to each object ( 4 byte integer (where possible)). On each execution of gc_reclaim, the version number is incremented. In the Dead-Object-Detection stage (which, in comparison, is also where small objects are copied out to their semi-space), versions of all live objects are updated. Additionally, the allocated size of both the young and old regions are calculated. Unlike the copying-collector, we have the option of growing the mid-sized heap before performing the copying/compaction; providing more up-to-date and thus accurate feed-back. If we don't copy, then we only compact the "young" region. Thus the advantage of aging objects is to avoid copying them (at the expense of leaving fragments). Alternatively, if we determine that there is too much free space in the aged section, we can temporarily reset the water-mark and compact that region too. The big advantage of compaction is that the age is directly related to how far up the heap we are; objects near the top are garunteed to be younger than at lower levels. The water-mark, is therefore an accurate representation of age. Since this heap is efficiently growable, I'm looking to start it at 16K (or some fraction of the semi-space), since each thread requires it's own local
Re: Calling conventions -- easier way?
On Fri, 19 Oct 2001, Gregor N. Purdy wrote: > Dan -- > FWIW, I'd rather not dedicate registers to special uses at the Parrot > level. Jako reserves [INPS]0 for temporaries, but that is at its > discretion, not dictated by Parrot (and I like it that way :). I was > wishing for separate data and address stacks when I was writing the > Jako calling convention stuff, but I can live with the rotate solution. > If the choices come down to reserved registers or separate stacks, > though, I'm squarely in the separate stacks camp. Not that it matters at this stage, but if we rely on compilers for temporaries, then what happens when we integrate different compiled binaries? MIPS (and I believe alpha) was good at reserved registers. Aside from "macro'd ops" most reserved registers only reserved at subroutine call time. Eventually all the compilers are going to have to deal with register renaming (we're going to run out of registers for a lot of code, and I don't think pushing and poping the register stack is going to be very helpful). I'd be curious to read up on register renaming compiler strategies (anybody have any online references?). Guess the basic algorithm is to assume an infinite number of registers, then perform data dependency checks and reduce the tree. In my up and comming post, I suggest a "peek R, i|ic" op-code which allows you to store fundamental datatypes in subroutine local stack memory (like hardware CPUs). If all variables are mapped within the stack frame, then register renaming should merely consist of pushing unused valued to their associated frame-offset and reusing that register and extracting it's replacement value from another address. -Michael
Re: Calling conventions -- easier way?
On Fri, 19 Oct 2001, Dan Sugalski wrote: > At 01:24 PM 10/19/2001 -0400, Gregor N. Purdy wrote: > >James -- > > > >Should we have bsr(i|ic, i|ic), that jumps to $1, with the return > >address below the $2 arguments? Similarly, should we have ret(i|ic), > >that rotates the return address out from under $1 arguments and then > >returns to it? > > Nope. For a number of reasons, the biggest being that we're not returning > our return values on the stack. :) > > > > OTOH, we could keep our current ABI, and pop the return address into an I > > > register, and then push it and ret, or jmp to it. > > > >The return address is a pointer, which doesn't necessarily fit in an IV, > >I believe. I played with doing this first, but didn't like having to use > >a register for it. If we wanted registers to be part of the calling > >convention, we should just say so, but again we run into the question of > >which type of register to use. > > I'm currently leaning either towards returning values in registers > [PSIN][0-4] with the total number of each type in register I0 somehow, or > as arg stack that's separate from the call stack. The latter is probably > the better way to go, but it's got more overhead in some ways. It'd make > interfacing with RPN languages easier, though. > I was in the middle of writing up something on this topic.. I'm still working on the details, so I'll summarize. The more out-of-band data the better.. It's faster (no munging an all in one stack), more efficient (don't have to abstract it's type), and we're not limited like some hardware is (e.g. having a single stack segment). Most of my bias is based on a paper on stack-based machines presented by someone else earlier in this mailing list. It was a good paper, though somewhat dated. The only problem with having out-of-band return addresses is syncing up with the stack (unrolling it for exceptions of frame-analysis). I was working on a concept that used segregated stack-frames for the parameter stack, with the stack frame header referencing the "top-of-return-address stack". Alternatively, the return-address-stack could contain a reference into the parameter stack(s). The latter would be easier for unrolling, but makes the pushed value a little more complex. I'm not particular on the details, though I'm still working on them, but I definately think out-of-line return addresses are "good thing"(tm). -Michael
Re: A warning on threads...
> I'm about to start welding in support for multiple interpreters, running > both serially and simultaneously, into Parrot. (With provisions for > starting and coordinating interpreters from other interpreters) > > This is just a heads-up, since it probably means platforms without POSIX > thread support will need to provide some workarounds. (I'll be putting > together a generic low-level thread interface with stub routines/#defines > to make things easier) FWIW, I do *not* plan on supporting POSIX d4 > threads--final draft or nothin'. (Or, rather, final draft or someone else > writes the wrappers...) > > Whether threads of some sort will be required for Parrot's up in the air--I > want to wait and see what sort of performance impact there is before making > that decision. > > Dan > Well, the memory allocator is most defaintely affected by having threading enabled at compile time. The default GNU (glibc) memory allocator assumes threading, as I'm sure the Solaris one does, so traditional builds (a-la RedHat) are not going to be affected (assuming there is a "use-the-OS-malloc" configuration flag like in perl5). The simple allocators (like perl5's) just use a big lock, which just adds a fixed overhead for both threaded and non-threaded (though hurting scalability). But glib, and the Solaris magazine-architecture (proposed in the earlier email) require a redesign which adds a tremendous amount of overhead for single threading (though they scale nicely). I've been avidly reading up on memory allocators and writing my version of a magazine allocator. I deviate substantially from the SUN paper, given that they suggest just using sbrk and mmap for anything below the slab (since they're trusting that the vmem architecture in the kernel is fast enough). This clearly isn't acceptible on non Solaris machines who may be horribly slow at mmap. Additionally, I'm taking advantage of the per-thread-interpreter, which avoids having to do ANY locking for the simplest case of small allocs/frees. More complex cases such as larger allocs which are too sensative to unused memory trade off locking contention for free space. The largest allocs (as with most memory schemes) are handled with simple low-level access (sbrk / mmap). My current incarnation is very module, which means that there are lots of function calls in the worst case. Once I debug it fully, I'll look into consolidating everything into one large function call which should avoid some redundant locking and further speed things up. The bad part about SUN's paper is that their benchmarks are hog-wash.. One of them shows performance when you use alloc/dealloc pairs of a fixed size. This is the optimal case for their allocation scheme, since it only invoves a handful of assembly instructions in highly CPU-local cached memory allocations. My benchmark takes a rather large array and randomly chooses cells within the array. If the cell-value is null, it makes a randomly sized allocation, if it's filled, it frees it. This at least simulates multi-sized, multi-lifetimed operations. My multi-threaded simultator simply utilizes n-thread-local-arrays each of size max_array_size/n. Assuming that we utilize one interpreter per thread, then there will be a significantly reduced need for locking (and CPU cache sharing on multi-CPU architectures). However, when we "require" multithreading, there is the temptation of handing off functionality like garbage collection to their own threads. MT is almost always slower (more total cpu time) / memory hogging than ST. (Though obviously less real time can pass with the rare case of multiple CPUs) Additionally, I'd argue that most apps are single-threaded in design, and wouldn't value the loss of performance tradeoff for threaded-functionality. Perl5 runs measurably slower when built for threading (even with only a single thread). And I believe that's mostly due to the extra pthread function calls. Granted Solaris has a nice and fast proprietary MT library, though it doesn't do the general (platform independent) case any good. I think it's fair to assume no additional logic / locking occurs within the op-codes (instead locks are relegated to specialized API functions that an op-code might eventually call). So depending on how the GC (and memory allocation in general) handles MT compilation, we shouldn't see too bad of a slow-down. But I still would like the option to have a non-MT compiled core for "hell-bent-execution", which I've regularly found to be useful (operating on multi-meg data-sets which can take half an hour or more to process). -Michael
Re: Fetching the PC?
On Fri, 12 Oct 2001, Dan Sugalski wrote: > At 01:30 PM 10/12/2001 -0400, Michael Maraist wrote: > >I'm of the opinion that you shouldn't just be able to jump into > >another code-segment. > > [Snip] > > >And thus be prevented from changing context; That would be relegated > >to a subroutine invocation (especially since such subroutines can be > >dynamically modified). > > Ummm... what do you think we're using to call into subroutines? Or just do > global control-flow changes? True, even a supposed callCV or gotoCV would have to return something valid. My current impression of the direction was to utilize callCV in a manner similar to "exception handling", which the last time I heard would look something like this: ne I1,I2, good setException ... end good: It stood to reason that callCV could do the same thing. Considering for a moment that the direction isn't completely different, the way I see this is as: // usage: // callCV P5 // end // fooOP // utilized on return AUTO_OP callCV_p { PMC cv = PMC_REG(P1); if ( !isCV(cv) ) { EXCEPTION(.) RETURN // auto-next-op, which is end (handles exception) } // excluded in gotoCV push( interp->return_stack, code + CUR_OP_SIZE + END_OP_SIZE ); if ( isCVInCurContext(cv, interp) ) { // set up local return value RETURN(getPCFromCV(cv)); // jump } else { interp->next_pc = getPCFromCV(cv); // further setup return value RETURN // auto-next-op, which is end, (does an inter-module jump) } } The advantage is that any context/scope cached values in the do-loop can be flushed on a context-switch. To minimize the overhead (of a c-call) we optimize for intra-context calls; what-ever that might entail; I'm guessing intra-module calls. Well, right off the bat, I admit that this method is paradoxical, since PMC->op is from a vtable, separate from op-codes. Such a vtable function handler can't just call into the op-code-table, so things get more complex anyway. None the less, this is the basic idea I had in mind. -Michael
Re: Fetching the PC?
On Fri, 12 Oct 2001, Ritz Daniel wrote: > > >i fixed that. but ther's only a jump_i, no jump_ic... > > > > > >"jump Ix" now jumps to the absolute address held in a register Ix. > > >Absolute means > > >start of the bytecode + Ix. > > > > It can't mean that. This: > > > >set I0, 0 > >jump I0 > > > > should coredump. > > > > i think parrot is a virtual cpu so the word absolute should be seen in context > of the vm, not the real cpu we are running on. > within the vm address 0 should be address 0 of the bytecode, not the > real cpu. but it would be nice to have a null pointerso what about the first > instruction in bytecode is at vm address 1? so a > set I0, 1 > jump I0 > would be a program restart and 0 is the null pointer for which we can check > with a conditional branch...and a jump to address 0 results in a crash > (it does so on my windoze). > > but the best solution in my opinion would be: > first instruction in bytecode is a 'end' (at address 0), then the program itself > starts at address 1. a jump to address 1 is a program restart, we can check > for null pointer with conditional branch and a jump to address 0 would be an > immediate program end (it doesn't look nice if an interpreter core dumps). or > better than an 'end' is an opcode that throws some error messages, ends the > program but does not core dump. I started out with an annoying question, but then managed to answer it myself. Still I think I can clear up some confusion as to the practicals. What you are suggesting basically is: unsigned int PC = interp->pc; int* code = interp->cur_code_segment; optable_t* optable = interp->optable; while(code[PC]) { PC = optable[code[PC]]->(code, PC, interp ) } A bounded version of the op-loop would simply be: while(PC < max_code_size && code[PC] ) DO_OP.. Since we're zero-bounded (due to unsigned int) This requires an extra level of indirection in most op-codes as well as in the main loop (not to mention the marginal overhead of an extra parameter). The assembler would have to do: s/ P(\d) / code[PC + $1] /x; instead of just s/ P(\d) / code[ $1 ] /x; I'm of the opinion that you shouldn't just be able to jump into another code-segment. That the interpreter core should be manipulated (to change the context, such as bounds). In either code[PC] or *code, the jump is still: AUTO_OP jump_i { return(INT_REG(P1)); } And thus be prevented from changing context; That would be relegated to a subroutine invocation (especially since such subroutines can be dynamically modified). I'll have to see some real subroutine implementatinos before I can support this method though. I'm curious to see if gcc -O2 can alleviate the over-head of *(code + PC + offset) for the parameters. For constant offsets, the x86 does a good job (at least at the assembly level; I know that the micro-ops still requires an extra add, but they may just be using a 3-way-add). In any case, I think we're all in agreement about not remapping the physical C-addresses, but for completeness I'll give the reasoning. Given that modulaA will have interp->code range from say 28M to 28.4M, moduleB will have interp->code range from 41.4M to 42.2M, etc - Where-ever mmap assigns the address. It would therefore be almost impossible to map PC to a linear physical address. Obviously PC can't be a contiguous zero-based address, since it's not code[PC], it's PC = cur_code + rel_branch, DO_OP( PC ). Physical indexed jumps, therefore are meaningless. "jump 500" where 500 is a compile-time-constant is really trying to say jump interp->code_base[ 500 ], not PC=500, DO_OP( PC ), since that would be a core-dump (this is c-memory address 500, which is off limits). But "jump interp->code_base[ 500 ]" is physically no more benificial than "branch PC - label"; Jump-register can be useful for certain optimizations (i.e. switch). Note that my original take was that we should either have while(code) DO_OP or while(code && *code) DO_OP as the fast-do-loop, since: set I1, 0 jump I1 would just act like exit. Then I thought about: set I1, 1 jump I1 Which would most definately core-dump. Thus we're not really much safter checking code's address than any other special value. The benifits of while(code[PC]) are potentially outweighed by the overhead of code[PC + arg_offset]. Thus I'm mostly inclined to always support the current "safe" method ("while(code>start && code > > -daniel > > > >the following code will result in a simple program restart, > > >no core dump: > > > set I1, 0 > > > jump I1 > > > > > >the fixed jump breaks the tests: basic5, call.pasm, jump.pasm > > >but i wonder why nobody realized that jump's broken, the doc says it jumps > > >_to_ > > >the address, not forward or backward > > > > That was brought up a while ago, but I don't think anyone's had time to put > > a patch in. I'm working on stack and jsr support, so I'll fix it then. > > > > Dan -Michael
Re: rand/srand patch
> At 08:48 PM 10/11/2001 -0400, Ryan O'Neil wrote: > >I was playing with Parrot and wanted a basic random numbers > >implementation. Just in case anyone else wants it too, here are the > >appropriate diffs and test file. It seemed logical for rand to return a > >real number between 0 and 1 instead of having any reliance on > >RAND_MAX. Any other ideas? > > I'm not 100% sure I want random number generation to be a fundamental part > of the interpreter. (Though it *is* really appealing...) This sort of thing > might be better put in the standard library rather than built in as a core > opcode. > > Nifty, though. :) My take on this follows from Larry's attempt to "Huffman encode" the language. What do we use most often, and how efficient should that stuff be? It seems that a log-string op-code could be very useful to module developers; if they can't get system modules to load properly than a garunteed log op-code would be invaluable. Alternatively, something like time and rand are quick-little functions that are very useful in inner loops. Profiling and optimized loops would really love to have them. The only negative I see is in the compiler complexity (the CISC v.s. RISC argument). I don't know how much time a system-library call will take (meaning this point might be moot if it's fast enough), plus time and rand are probably good candiates for over-loading which would require higher-than-op-code status. Oh well. -Michael
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
> void > runloop(int**code) { > static void* array[] = { &&op_end, &&op1, &&op2 }; > > start: > goto *array[*code]; > > op1: > foo; goto start; > > op2: > foo; goto start; > > op_end: > return; > } // end code > w/ code=1000, loop_cnt = 5,000,000 > gcc -O3 speeds.c > -- > switcher time delta=195 > caller time delta=268 > jumper time delta=185 > code compile time delta=0 > fast jumper time delta=154 > I saw another post on the improvements provided by the simple modification of: goto *array[*code]; lbl1: # do stuff goto *array[*code]; lbl2: # do stuff goto *array[*code]; ... end: return; And they were right. I received MASSIVE performance boosts. gcc -O3 w/ 1000 ops and 5,000,000 loops jumper code = 31/32 seconds fast jumper = 32/31 seconds gcc -O3 w/ 50,000,000 ops and 10 loops jumper code = 14 seconds fast jumper = 13 seconds This is a FULL 500% faster than the fastest above code. Note also that the "fast jumper" is only on par with the portable version. (The fast jumper modified the source-code to insert physical addresses instead of using indexes to the array). Two possible reasons for this. One is cache-related issues, but I highly doubt it because even the 50Meg (non-cachable) code was insanely fast (thanks mostly due to predictable pre-caching). The reduction of a branch is most likely the reason. Previously the instruction until a branch was like 2 so the pipeline couldn't be utilized. Note that more complex instructions wouldn't help; they'd just be more efficient. What strikes me as odd though is that I used a constant jump (goto start). The Pentium Pro line is supposed to be able to handle this efficiently. The next thing to test is having 100 core op-codes. Then it'll be interesting to see how horrible adding the conditionals are. Theoretically we don't need any conditionals. "change_state" op-codes can simply return NULL just like end does, thereby breaking out of the loop automatically. I never understood the reasoning behind while(*code) instead of just while(code) (which is properly handled here). Obviously there's value to while( code > start && code < end ), but that's in the non-hell-bent loop. Oh by the way. 500M ops in 15 seconds is 1.02 Billion parrot ops / second. This isn't too shabby on a mere 466MHZ machine. That's a scant 2.18 parrot ops per cycle. Getting pretty close to the theoretical max that architecture can perform. It ran even faster with the non-cached memory (200Meg sys-ram), than with short cachable bursts but that's mostly because the prefetcher was always correct. Note that this was only because there were only 3 assembly instructions per op-code, and all the op-codes were of the same form (inc, get-next-op, jump). Pure-mathematical real-parrot-ops should get pretty close to this, but most parrot-ops will have evil conditionals that thwart such performance. -Michael
Re: [PATCH] Big patch to have DO_OP as optional switch() statment
On Tue, 9 Oct 2001, Dan Sugalski wrote: > At 01:20 PM 10/9/2001 +0200, Paolo Molaro wrote: > >On 10/07/01 Bryan C. Warnock wrote: > > > while (*pc) { > > > switch (*pc) { > > > } > > > } > If anyone wants an official ruling... > > DO_OP can contain any code you like as long as it: > > *) Some version ultimately compiles everywhere > *) Allows opcodes above a certain point (probably 256 or 512) to be > overridden at runtime > > Computed gotos, switches, function table dispatches, generated machine > code, or some other form of coding madness are all OK. I don't care which > way a particular platform goes as long as it doesn't constrain another > platform's ability to go another way. Hehe. Ok, I delved into the particulars of gcc and found: void runloop(int**code) { static void* array[] = { &&op_end, &&op1, &&op2 }; start: goto *array[*code]; op1: foo; goto start; op2: foo; goto start; op_end: return; } // end code The key was &&lbl which returns the physical address of the a jumpable label. This is highly gcc-specific, but given that we can use #ifdef's I don't see a problem with it. :) I even wrote an ugly version that avoided that array indirection. What it did was "convert" the code-base to change the op-codes into physical addresses. This obviously has some negative ramifications, but is definately doable. By removing that extra indirection I got a couple percent extra improvement (12%). I can't imagine you'll get much faster than that (except for an inlining jit or perl6->C compiler). The only problem was that the &&lbl's didn't want to be non-local, so I had to perform some evil magic. The unpolished code is attached with a cheesy instruction set which was just designed to prevent the optimizer from optimizing away my op-codes. Here are the benchmarks on a 466MHZ dual proc celeron: w/ code=500,000,000 and loop_cnt = 10 (cache-bound) gcc speeds.c - switcher time delta=45 caller time delta=47 jumper time delta=43 code compile time delta=4 fast jumper time delta=34 gcc -O speeds.c - switcher time delta=30 caller time delta=41 jumper time delta=28 code compile time delta=3 fast jumper time delta=26 gcc -O2 speeds.c --- switcher time delta=30 caller time delta=41 jumper time delta=28 code compile time delta=3 fast jumper time delta=25 gcc -O3 speeds.c switcher time delta=29 caller time delta=41 jumper time delta=28 code compile time delta=2 fast jumper time delta=26 == == w/ code=1000, loop_cnt = 5,000,000 gcc -O3 speeds.c -- switcher time delta=195 caller time delta=268 jumper time delta=185 code compile time delta=0 fast jumper time delta=154 # Note a marginal improvement from switch -> jump, yet an even bigger improvement in jump to fast-jump. It's unlikely to have a 50M (x4B+) code-base. But it's possible to have an inner loop that has 500,000 iterations. -Michael #include #include #include #include #include #include #include #include //#define SIZE 5000 #define SIZE 1000 int codep[SIZE]; void dump_mappings() { char strbuf[1024]; int pid = getpid(); sprintf( strbuf, "cat /proc/%i/maps", pid ); system( strbuf ); sprintf( strbuf, "ls -l /proc/%i/fd" ); system( strbuf ); printf("press enter to continue"); scanf("%c",strbuf); } int x,y,z; void switcher( int*code) { while(1) switch(*code) { case 0: goto DONE; break; case 1: x++; code += 1; break; case 2: y++; code += 1; break; case 3: z++; code += 1; break; case 4: x = y + z; code += 1; break; } DONE: } // end switcher int* end(int*code) { return 0; } int* o1(int*code) { x++; return code+1; } int* o2(int*code) { y++; return code+1; } int* o3(int*code) { z++; return code+1; } int* o4(int*code) { x = y + z; return code+1; } typedef int* (*op_handler_t)(int*code); op_handler_t op_codes[] = { end, o1, o2, o3, o4 }; void caller( int*code) { while(*code) code = op_codes[ *code ](code); } // end caller void gotoer( int*code) { static int* array[] = { &&lend, &&lo1, &&lo2, &&lo3, &&lo4 }; start: /* if ( !code ) { puts("null code"); exit(-1); } if ( *code > 3 ) { puts("Invalid idx"); exit(-1); } */ goto *array[ *code ]; lo1: x++; code += 1; goto start; lo2: y++; code += 1; goto start; lo3: z++; code += 1; goto start; lo4: z++; code += 1; goto start; lend: return; } // end gotoer int* convert( int size, int*code, int*map ) { int* retval, *base; int idx =0; //dump_mappings(); retval = base = (int*)malloc(size * sizeof(int)); if ( !retval ) { puts("out of memory"); exit(-1); } //dump_mappings(); //printf("retval=%p, codep=%p\n", retval, codep ); /* if (!retval) { puts("couldn't allo
Re: [PATCH] content preserving register pushes
> Currently, instead of pushing the contents of fixed registers onto the > stack, the registers themselves float on top of the stack. That doesn't > preserve register contents on a push_x, as the registers are now in a new > memory location. After toying with fixed registers per interpreter, I > decided that it probably wasn't worth it in the long haul. (Since it's > doubtful, given the size of our register sets, that this would need to be > done often.) Instead, I'm proposing a parallel set of data-preserving > push_x ops, save_x. Patch after .sig. > The biggest problem that you have with save_x is how to retrieve values when you do a pop_x. Most any nested block / subroutine call needs to return at least one value. One possibility is via a consolidate op-code which is just like pop but accepts a single register to preserve. AUTO_OP preserve_i { int tmp = INT_REG(P1); Parrot_pop_i(interp); INT_REG(P1) = tmp; } But that only works when you have exactly 1 preserved reg; not always the case. Especially with nested blocks of code which might want to modify several of it's enclosed scope-variables. My point is that by itself save_i is useless except for possible side-effects (such as outputting). I've suggested before a SPARC-like register window. The relavent part here is that you can do a partial push / pop of half the register register-set. This can be thought of as dividing the register set into Input / Output registers. A given scope uses all registers. But when it comes time to enter a new scope they have to make sure that all output parameters are in the upper half, then perform a half-push. When leaving a scope, they have to make sure that all returnable parameters are in the lower half, then perform a half-pop. One advantage is that this is faster than a save_i instruction since only half the regs are saved. Additionally we're not duplicating ALL the registers, just those that we're not using in the nested scope. Saves (num_regs/2)*sizeof(reg) memory on each new scope (including function calls). The only complex change this would require in the existing code-base is that register sets couldn't just be linked-lists. They'd have to exist as arenas of multiple contiguous reg-sets. The current reg set would consist of an arena pointer and an offset into that arena (cached by the reg-set pointer). If you're at the end of a memory arena, for example, a half-push would require allocating a new arena with num_regs*num_frames_in_arena then copying num_regs/2 registers from the top of the old arena to the bottom of the new arena since a register set has to be contiguous. Full pushes shouldn't be affected; if there is only a half-frame left in the current arena, then it can just skip it and allocate a new arena. Half-pushes for subroutines are great since you can optimize small subroutines to work within the limited free "out-regs". So the following code: sub foo($x : int, $y : int ) : int # equiv to C "int foo(int,int)" { return $x + $y } $::z = foo($::x, $::y); can be: foo: add I16, I16, I17 return main: ld I16, [x] ld I17, [y] call foo st [z], I16 Additionally: sub bar($x : int ) : int { return foo( $x, 5 ); } $::z = bar($::x) becomes: bar: half_push_i # I16 is now I1, I17 is now I2 mov I16, I1 # I1 -> I16 set I17, 5 call foo mov I1, I16 # save output of foo to our output reg return main: ld I16, [x] call foo st [z], I16 Note that in this case we could have additionally optimized bar to: bar: set I17, 5 # append to parameter stack call foo return Note this only works when the parameter numbers are known. We'd still have to resort to something else for dynamic parameters. I'm partial to an external general-purpose stack. For anonymous invocation $dispatcher{ foo => \&foo ) $val =<>; @params = <>; $dispatcher{ $val }->( @params ); You'd have to have a generic wrapper function that performed bounds-checking for the prototype. foo_generic: GetStackFrameSize I16 # SP-FP -> I16 ne I16, 2, foo_bad_num # invalid num of parameters POPStack I16 # throws exception if not type int POPStack I17 call foo # hopefully storing on separate return stack since it's not # useful as a register??? Else we need to save reg-stack PushStack I16 return foo_bad_num: exception "invalid num params" end return main: push_p push_s PushStackFrame # $dispatcher{ foo => \&foo ) newHash P1 ld P2, ["foo_generic"] # CV SetHash P1, "foo", P2 # $val =<>; readln S1, 0 # read stdin # @params = <>; newArray P3 readArray P3, 0 # gulp rest into @P3 (@params) HashFetch P4, P1, S1 # P4 is presumably CV PushStackFlatten P3 # takes each el from P3 and puts on stack call foo_generic # assumes X16..X32 are volitile PopStackFrame ## I had originally wanted a complete emulation of the SPARC reg-stack, since it was very elegant. You'd have 2 separate register stacks. You'd what I've des
Re: vtable.h
On Sat, 6 Oct 2001, Michael Maraist wrote: > My question at this point is if the PMC's are polymorphic like Perl5 > or if there is an explicit type type. Polymorphics can make for vary > large vtable sub-arrays. (int, int_float, int_float_string, > int_string, etc). > > If PMC-types are bit-masked (for easy upgrading) such as: > O O O O > ^ ^ ^ > | | | ... > INT FLOAT STR > > We could apply a macro that extract the desired type. > Such as GET_PMC_TYPE_INT(Px) which if it is of type int, it returns > int, else float else string. > > #define GET_PMC_TYPE_INT(type) (type & PMC_INT)?PMC_INT : (type & > PMC_FLOAT)?PMC_FLOAT : (type & PMC_STRING)?PMC_STRING : type > > Likewise GET_PMC_TYPE_FLOAT would return first float then int then string > > It's not as fast because we're not avoiding the nested if-statements, > but it's easy enough to read. > > P2->vtable->add[ GET_PMC_TYPE_INT(P3->type) ]->(...) Ops! Stupid me. I forgot that at the add_p_p_p level we don't know that it's an INT / FLOAT, etc. The only way that we could use bit-masked types is if we wrote a complex if statement: ( P2 & PMC_INT )? (P3 & PMC_INT ? PMC_INT : (P3 & PMC_FLOAT ? PMC_FLOAT : PMC_STR ) : (P2 & PMC_FLOAT)? (P3 & PMC_FLOAT ? PMC_FLOAT : ( P3 & PMC_INT ) ? PMC_INT : PMC_STR ) : (P2 & PMC_STR)? ( P3 & PMC_STR ? PMC_STR : This is beyond ugly, not to mention not upgradable if/when we add new types. So unless we're using an enum-style type-compaction doubly indirected vtables won't be feasible. > > Ideally, the bit-pattern for the pmc-type is numerically small (for > small sub-arrays). > > enum PMC_TYPES { PMC_INT, PMC_FLOAT, PMC_STR, PMC_INT_FLOAT, > PMC_INT_STR, PMC_INT_FLOAT_STR, > PMC_FLOAT, ... }; > > In this way we simply map everything that has INT in it to the same > handler. No conditionals at all (but lots and lots of vtable space). > Thankfully this is constant and could be assigned globally such that > there is no intialization overhead. > > -Michael > -Michael
Re: vtable.h
On Sat, 6 Oct 2001, Simon Cozens wrote: > On Sat, Oct 06, 2001 at 09:01:34AM -0500, Gibbs Tanton - tgibbs wrote: > > Also, how will adds of different types be handled. In the above if pmc2 is > > an int and pmc3 is a float we're going to have to know that and do a switch > > or something to convert to/create the right type. > > There'll actually (and I need to change my vtable code to reflect this) be > several versions of each vtable function, depending on the relative type of > each PMC. Basically, there'll be two easily optimizable versions (i.e. types > are the same, or types can be easily converted with a cast or simple function) > and a non-optimized version, which would actually be the naive implementation > in many cases. ("These types are way out of my depth - call ->get_integer on > each one, and add the result.") So would it be something like(ultimtaely put into a macro): AUTO_OP add_p_p_p { if (!P1) CREATE_PMC(P1); if (!P2 || !P3) throw exception; // however this is done in Parrot P2->vtable->add[P3->type]->(interp, P1, P2, P3); //in macro } In this way each vtable operation is really an array of handlers for each possible type of input. This avoids any comparisons. Invalid comparisons all share a common function (which throws an "invalid data-intermingling" exception). int_pmc_vtable = { ., { pmc_vtable_add_int_int, pmc_vtable_add_int_float, pmc_vtable_add_int_string, pmc_vtable_add_int_iconst, pmc_vtable_add_int_fconst, ... }, ... }; // maps to "RET_DT pmc_vtable_add_int_int(interp_t*, PMC*, PMC*, PMC*) AUTO_VOP add_int_int { UPGRADE(P1,PMC_INT); P1->ival = P2->ival + P3->ival; } AUTO_VOP add_int_float { UPGRADE(P1,PMC_INT); P1->ival = P2->ival + P3->fval; } AUTO_VOP add_int_iconst { UPGRADE(P1,PMC_INT); P1->ival = P2->ival + P3; } AUTO_VOP add_int_fconst { UPGRADE(P1,PMC_INT); P1->ival = P2->ival + Parrot_float_constants[P3]; } AUTO_VOP add_int_string { UPGRADE(P3,PMC_INT); UPGRADE(P1,PMC_INT); P1->ival = P2->ival + P3->ival; } Alternatively, if we can't be both a string AND an int in PMC: AUTO_VOP add_int_string { int p3ival = PMC_STR_TO_INT(P3); UPGRADE(P1,PMC_INT); P1->ival = P2->ival + p3ival; } This assumes that a = b op c will be the same as a = b.op( c ), which I think is fair. Thus add_float_int produces a float while add_int_float produces an int. The compiler can worry about the order or the parameters. I don't think there's much value in writing separate a op= b since you could just do: P1->vtable->add[P1->type]->(interp, P1, P1, P2); With hardly any additional overhead. The "optimized" code might have been: AUTO_VOP inc_int_int { P1->ival += P2->ival; // avoids casting P1 } But now you have LOTS more vtable ops. My question at this point is if the PMC's are polymorphic like Perl5 or if there is an explicit type type. Polymorphics can make for vary large vtable sub-arrays. (int, int_float, int_float_string, int_string, etc). If PMC-types are bit-masked (for easy upgrading) such as: O O O O ^ ^ ^ | | | ... INT FLOAT STR We could apply a macro that extract the desired type. Such as GET_PMC_TYPE_INT(Px) which if it is of type int, it returns int, else float else string. #define GET_PMC_TYPE_INT(type) (type & PMC_INT)?PMC_INT : (type & PMC_FLOAT)?PMC_FLOAT : (type & PMC_STRING)?PMC_STRING : type Likewise GET_PMC_TYPE_FLOAT would return first float then int then string It's not as fast because we're not avoiding the nested if-statements, but it's easy enough to read. P2->vtable->add[ GET_PMC_TYPE_INT(P3->type) ]->(...) Ideally, the bit-pattern for the pmc-type is numerically small (for small sub-arrays). enum PMC_TYPES { PMC_INT, PMC_FLOAT, PMC_STR, PMC_INT_FLOAT, PMC_INT_STR, PMC_INT_FLOAT_STR, PMC_FLOAT, ... }; In this way we simply map everything that has INT in it to the same handler. No conditionals at all (but lots and lots of vtable space). Thankfully this is constant and could be assigned globally such that there is no intialization overhead. -Michael
Re: Okay, hardware gurus....
> > Linux/Athlon/gcc. > > > > Why does changing this: (DO_OP loop partially inlined) > > > > while (pc >= code_start && pc < code_end && *pc) { > > do { > > x = z->opcode_funcs; \ > > y = x[*w]; \ > >w = (y)(w,z); \ > > } while (0); > > } > > > > to > > > > x = z->opcode_funcs; > > > > while (pc >= code_start && pc < code_end && *pc) { > > do { > > y = x[*w]; \ > >w = (y)(w,z); \ > > } while (0); > > } > > > > slow it down by 6%? > > Perhaps x is no long considered a register. Are you compiling with -O2? > compile to source (-S on gcc) and loop for that inner loop. See what is > physically different. (I'm away from my test-environment at the moment) > Oh, better yet, given all the cache-work I've done in the past two days, it's entirely possible that a cache line is violated due to the new assembly ordering. You'd have to check the assembly source to know. One check you can perform is to put a temporary variable w/in the loop and assign it to the outer variable. This avoids the indirection but keeps the cache / register set populated. If you did you -O2 it shouldn't matter though, because the code would be reordered either way. -Michael
Re: Okay, hardware gurus....
> Linux/Athlon/gcc. > > Why does changing this: (DO_OP loop partially inlined) > > while (pc >= code_start && pc < code_end && *pc) { > do { > x = z->opcode_funcs; \ > y = x[*w]; \ >w = (y)(w,z); \ > } while (0); > } > > to > > x = z->opcode_funcs; > > while (pc >= code_start && pc < code_end && *pc) { > do { > y = x[*w]; \ >w = (y)(w,z); \ > } while (0); > } > > slow it down by 6%? Perhaps x is no long considered a register. Are you compiling with -O2? compile to source (-S on gcc) and loop for that inner loop. See what is physically different. (I'm away from my test-environment at the moment) -Michael
RE: thread vs signal
On Sun, 30 Sep 2001, Hong Zhang wrote: > > >How does python handle MT? > > > > Honestly? Really, really badly, at least from a performance point of view. > > There's a single global lock and anything that might affect shared state > > anywhere grabs it. > > One way to reduce sync overhead is to make more operation atomic instead > of of sync. For example, read() write() are atomic. There is no need to > sync stream. The array get/put are atomic in Java, so we don't need sync > either. The high level library or app itself will be responsible for the > sync. Now how do you go about performing an atomic operation in MT? I understand the desire for reentrance via the exclusive use of local variables, but I'm not quite sure how you can enforce this when many operations are on shared data (manipulating elements of the interpreter / global variables). I definately agree that locking should be at a high level (let them core if they don't obey design it well). I liked the perl5 idea that any scalar / array / hash could be a mutex. Prevents you from having to carry around lots of extra mutex-values. We can achieve the exact same "synchronization" policy of java or one that's finer tuned for performance. -Michael
Re: thread vs signal
> Dan Sugalski wrote: > > > >How does python handle MT? > > > > Honestly? Really, really badly, at least from a performance point of view. > > There's a single global lock and anything that might affect shared state > > anywhere grabs it. > > i.e. not so much 'threaded' as 'stitched up'. Well, python never advertised speed. Threading is just a clean way of handling certain types of problems. Sounds to me like they simply valued simplicity and stability above other things. So the word "bad" is obviously subjective. I've definately heard people praise their extension code. Recall, in fact, that in the review of the great perl6 conference, several people walked out saying that they were just going to get into Python. Speed can't be the top factor here. If you make extension code easy enough to write, then people can profile and write "fast" code when necessary. And more to the current point, MT programming isn't necessarily faster code (especially if you don't have multi-CPUs). The classic example of MT-code where you spawn off a new thread to do some CPU-intesive work in the back-ground is less common than simple IO threads (which includes server connection points and loading/saving files). More often, I see "forked" code which handles slow tasks. They're reliable, and have little over-head so long as the amount of work is great. -Michael
RE: SV: Parrot multithreading?
> > or have entered a muteX, > > If they're holding a mutex over a function call without a > _really_ good reason, it's their own fault. General perl6 code is not going to be able to prevent someone from calling code that in-tern calls XS-code. Heck, most of what you do in perl involves some sort of function call (such as stringifying). whatever solution is found will probably have to deal with exceptions / events within a mutex. That said, there's no reason why we can't have _all_ signal handler code be: void sig_handler: interp->signal=X; This would just require some special handling within XS I suspect. -Michael
Re: thread vs signal
> > I generally divide signals into two groups: > > *) Messages from outside (i.e. SIGHUP) > *) Indicators of Horrific Failure (i.e. SIGBUS) > > Generally speaking, parrot should probably just up and die for the first > type, and turn the second into events. I don't know. SIGHUP is useful to capture. Not to mention, how else would you maintan Perl5 compatibility $SIG{HUP}. If you're a daemon, sighup is good to restart off of. I'm seriously doubting that you're going to get a fully portable threading model. One size doesn't fit all. Usually that's something that the JIT would take care of (hense my preoccupation with fake-threads).. How does python handle MT? -Michael
Re: [PATCH] print_s_v op (was: RE: variable number of arguments)
> All -- > > > I've created a varargs-ish example by making a new op, print_s_v. > > This is pretty rough, and I haven't updated the assembler, but it > > seems to work. > > Um.. I *have* updated the assembler. Its the *dis*assembler I haven't > updated. This is what happens: > > * *_v ops list their number of args as 1 + the number of fixed > args. They list the types of the fixed args, plus a last arg of 'v'. > > * The assembler sees all this and splices in an arg count for the > 'v' arg place, and then hangs however many args remained on the > line after the count. > > * The op decides how many args it is expecting. For print_s_v, > that is via the format string (but really should take care to > not read more than the "arg count" arg says are present. > > * Since the op is in control of returning the next address to > execute, it is trivial for the op to arrange for excecution to > continue after the last vararg (thanks, Dan!) > > * Since the assembler splices in the arg count, though, the > disassembler won't have to think very hard to find the beginning > of the next op when doing its job. > > * Of course, all this assumes that all args have sizeof() == > sizeof(IV), which won't be true until NV constants are moved to > the constants area. SO DON'T USE THEM. :) > > Anyway, whether or not folks really like the idea of having ops with > variable numbers of args, this proves its not too hard to do with a > small amount of cooperation between the assembler and the op func. With var-args, we could produce highly efficient SIMD instructions. printf obviously, multi-push/pop, createArrays/ Hashes. I toyed with the idea of a stack-based ALU to handle algebra such as: ( ( a + b*c ) / d + e ) / ( f + g + h ) // compute dest, "fmt", register-names compute Rd, "++^*+/+", f, g, h, a, b, c, d, e Unfortunately I couldn't find a way of garunteeing that it's efficient. The best I've come up with so far is to use a mini-switch statement. The only reason it might be faster than an optimized switch-statement over all op-codes is that the switch has fewer terms. The above case also gets complex if we don't use multiple registers for intermediate calculations. None the less the possibilities are cool. -Michael
RE: Strings db
> and a call to the API would be: > char *label = gettext( "This feels strange\n" ); Does you idea allow for: int msgid = txtToMsgid( "This feels strange\n" ); char *label = msgidToRes( msgid ); In addition to the above, since this affords compile-time optimizations? I'm not following this thread with enthusiasm (I'm an English Biggot myself), but I'm assuming it involves (among other things) displaying locale-based error messages. Such messages could be known after compilation. Granted errors don't really need to be fail-fast. -Michael
RE: variable number of arguments
> > > >I have a minor issue with a proliferation of createArray. In perl5 we > >used the Stack for just about everything minus physically setting @x = > >(1,2,3). The creation of a dynamic array is a memory hog. > > Less of a hog in many ways than using a stack. Worth the times when it's not. I don't see how. A stack is "supposed" to be pre-existing and maintain headroom for growth. The simplest stack (which I think is plenty peppy) is one that checks bounds and realloc's when it grows too much. If it grew that big once, then it would surely grow that bug in the future, which alleviates memory fragmentation. There's already a stack structure in interpreter_t. I'm not sure how parameter passing is being implemented yet, but anything other than some type of stack is going to have problems. (which can only be addressed via dynamic allocation of arrays.. which are just slower-parameter-stacks). > > >ESPECIALLY with > >mark-and-sweep dynamic memory (which is sounds like perl6 is aiming for). > > I don't see the connection here. The connection had to do with the allocation / deallocation of these intermediate dynamic arrays. It was just a call-of-attention to the ever present point of performance degredation. > > If the compiler generates code like that and I'm happy about it, someone > needs to smack me. *hard*. :) > > That ought to be either a series of print ops or, more likely, a set of > push ops and a final prints. But pushing onto what? A PMC which points to a dynamic array? Pushing is arguable slower than pre-extending an array and statically assigning to an index, since you avoid multiple increments. Not to mention if you push a million entries, you might have to regrow the array 20 or more times. I stated why I used the array for the print, but you could substitude print for "@y = map { .. } (1,2,@x)" which would require an intermediate array of memory. We just didn't have asembly available for that code. You could set P1 to the new array and concat the three terms, but that ultimately involves iterating through the array @x and "pushing values". So you've not escaped the use of a stack. What's more since this is a temporary stack, you deal with memory management. Once again, the alternative is a general purpose stack, who's memory management is minimal. > >In any case, what we've done here is dynamically create an array each time > >through the loop. > > Yes, but we're doing this with a stack-based system anyway. It's just an > anonymous pesudo-array (i.e. the stack top), but an array nonetheless. I'm not convinced that we can't /shouldn't implement persistent general-purpose stacks, which include parameter passing. Obviously this would require frame-support (since otherwise @_ would have to be a newly allocated array), and it seems that the direction is towards the exclusive use of registers. I just don't know that this doesn't provide even greater performance / memory penalties than for non-static parameter functions (such as print / map / grep or non-proto-typed subroutines). The way I see such a stack being used: interp: .. Stack stack; Stack frame; AUTO_OP push_x { if ( stack->top >= stack->size ) { grow_stack(stack); } setPMC( stack->data[ stack->top++ ], P1 ); } AUTO_OP get_stack_frame_size { REG_INT(P1) = stack->top - frame->data[ frame->top ]; } AUTO_OP push_frame { // auto-extend frame-stack frame->data[ frame->top++ ] = stack->top; } AUTO_OP get_stack_X_i { REG_X[P1] = [ stack->data[ frame->data[ frame->top ] + P2 ]; } // set_stack_X_i // alloc_stackn AUTO_OP pop_frame { stack->top = frame->data[ --frame->top ]; } Internal c-code could even get pointers to the base-frame so as to avoid all the indirections. So print, map, grep, etc would be able to read quickly. You could additionally have efficient op-codes that copy in PMC arrays to the stack. That said, if we're exclusively using registers for parameter passing then there's currently the issue of efficiently passing and returning values. Currently the only way to pass parameters is to duplicate a given register set, then move parameters around so they'll be where a subroutine expects it. The called function would then push any additional register-segments-stacks that weren't given parameters so it wouldn't overwrite the caller's values. When complete, it would restore those reg-stacks that it pushed and would leave return values as presents in P1, I1, etc. But now the caller has to pop it's register-stacks to get back to it's old values. But in doing so, it'll lose the return value(s). There are two ways I can see this working. The first involves doing a half-push / half-pop. Where you have a 16register input window and a 16 register output window. If a function can accomplish it's tasks with whatever is left-over, then it doesn't need to do another half-push (it just can't touch P1..16). On return, the return-values are in P17..P32. The caller would have to copy the
Re: parrot malloc (was Re: Parrot coredumps on Solaris 8)
On Mon, 24 Sep 2001, Uri Guttman wrote: > > "AB" == Alan Burlison <[EMAIL PROTECTED]> writes: > > AB> If you are interested, I'll try to get a copy of the paper to you. > AB> It has some interesting ideas. By having 'arenas' for each object > AB> type, you can remove a large part of the overhead of > AB> splitting&recombing memory cgunks. The perl5 allocator already > AB> does something similar. > > sounds like a wheel worth looking at. are there working free C libs that > implement it? how much work is it to write a new one that is > parrot-centric? would we need all of it features that manage other > things? i assume that the segmented stacks would also use arenas like > this. we will need to define our set of arenas and sizes at some point. > > we haven't touched threads and malloc. does each thread want its own > arena so we don't need locking? > Interesting idea. The main allocator could be locked, so long as x% of all allocations come from thread-specific arenas. The only problem is that this adds overhead to creating / deleting threads (setting up / freeing arena chunks). This would only be efficient if the main allocator only returned chunks on the order of a page-size (which is then divided up into smaller granuled arenas by the parallel procssing threads). Another method would be to act like databases do - You finely tune your memory allocator so that you only perform regional locks. You also point threads to different starting / insertion points along the memory chains. This way you have a unified memory allocation scheme with only occasional contention locks. If all your data-structures are in large chunks (on the order of a page), then this would do well. (Though it might require architecture specific test-and-set operations for efficiency). -Michael
RE: [PATCH] assemble.pl registers go from 0-31
> Just curious, do we need a dedicated zero register and sink register? > The zero register always reads zero, and can not be written. The sink > register can not be read, and write to it can be ignored. I brain-stormed this idea a while ago, and here's what I came up with. We're not RISC, so we don't need to worry about mapping "mov %r1 -> %r2" to "or %r1, %r0 -> %r2", which is probably the most common use. Additionally we'll always be able to use constants of size 32/64bits. RISC could only use 8-22bit constants so often times it was easiest to operate on a full 32bit zero. We could always add a second constant parameter to an instruction if it was demed useful enough. As for a bit-buck, I can't imagine why you'd use it. Side effects and no-ops are all I can remember them ever being useful for. But again we're not RISC, we have a no-op, and can add separate functions for side-effects. Lastly, the main reason I say no: AUTO_OP add_i_i_i { int p2_val = 0, p3_val = 0; if ( P2 != 0 ) { p2_val = REG_INT(P2); } if ( P3 != 0 ) { p3_val = REG_INT(P3); } if ( P1 == 0 ) { p2_val+p3_val; } else { REG_INT(P1) = p2_val+p3_val; } } Any questions? You could have an optimizer convert all %i0 to constants and rearrange the op-code so it maps to a real one, but you could just have had the compiler produce proper code in the first place. I don't think we're looking to use assembly as one of the "supported" language layers. Due to it's propensity to change. -Michael
RE: variable number of arguments
> > is it possible the ops to handle variable number of arguments, what I have > > in mind : > > > > print I1,",",N2,"\n" > > This should be done by create array opcode plus print array opcode. > > [1, 2, 3, 4, 5] I have a minor issue with a proliferation of createArray. In perl5 we used the Stack for just about everything minus physically setting @x = (1,2,3). The creation of a dynamic array is a memory hog. ESPECIALLY with mark-and-sweep dynamic memory (which is sounds like perl6 is aiming for). Lets say you wanted for(...) { x++ y++ z++ print "x=$x, y=$y, z=$z" } Then you'd possibly have LOOP: cond_eq Ix, DONE inc I1, 1 inc I2, 1 inc I3, 1 createArray P1, 6 setIndexOf P1, 0, "x=" setIndexOf P1, 1, I1 setIndexOf P1, 2, ", y=" setIndexOf P1, 3, I2 setIndexOf P1, 4, ", z=" setIndexOf P1, 5, I3 print P1 branch LOOP DONE: Obviously this could be written differently (say using 6 separate print-statements.. but we're saying that print is slow, so we'd want to minimize calls to it; possibly via performing an implicit join when printing an array). Additionally createArray could have happened on one line, but that's not an issue either (this avoids type-issues). In any case, what we've done here is dynamically create an array each time through the loop. Theoretically we could reuse the same array, but what if instead of a loop this is part of a log function which happens to get called a zilion times in debug mode. For many types of gc-ers, pointing to something new leaves the unreferenced data stranded until we're either memory starved or a timer goes off. In the case of a createArray in an inner loop, we could segment the entire memory map into chunks that fit this dynamic array before we starve for memory. Depending on the memory allocator, gcing that data will mean coallescing ALL of it (potentially hundreds of meg worth). Recently, there was an international programming contest for compressing XML, and I wrote a program using perl. Worked great, except when I used input data on the order of a meg, I consumed 300-400meg worth of user-memory. No biggie, but when the program was finished (physically output the result) it took just as long to deallocate everything (20minutes). My guess is that the gnu memory allocator used by perl5 w/in red-hat 7.1 did some heavy coallescing. When I used perl5's shipped memory allocator w/ no coallescing, the program exited in a nominal amount of time (while saving some 100Meg of memory to boot). Lets just say that I developed a new fondness for the perl5 memory allocator. The lesson I learned was to not take for granted dynamically allocated memory). A simple mark-and-sweep GC is going to have a major hickup periodically with this sort of inner-loop allocation. In this case ref-counting would at least not leave a stragling allocation, and thus would not micro-fragment / hickup. I'm sure you can write the mem-allocator to gc before fragmenting larger chunks, but it's still excess CPU overhead for this task. The point here is to avoid the use of dynamic allocation where possible; especially with the internals. This is a perfect use for the perl-stack, short and simple. -Michael
RE: Strings db
> GNU does offer the gettext tools library for just such a purpose. I don't > know how it will translate to the various platforms however, and it likely > is a major overkill for what we are trying to do. > http://www.gnu.org/manual/gettext/html_mono/gettext.html#SEC2 - Purpose > It might make sense to implement something like this now, rather than create > our own system and find out it is insufficient down the road. > Just a thought, > Grant M. But wouldn't that make parrot GPL'd? -Michael
Re: SV: Parrot multithreading?
> >then what about a win/win? we could make the event checking style a > >compile time option. > > Odds are you'll get per-op event checking if you enable debugging, since > the debugging oploop will really be a generic "check event every op" loop > that happens to have the "pending debugging event" bit permanently set. > Dunno whether we want to force this at compile time or consider some way to > set it at runtime. I'd really like to be able to switch oploops > dynamically, but I can't think of a good way to do that efficiently. > long-jump!!! runops(bla bla){ setjmp(..); switch(flags) { fast_runops(bla bla); debug_runops(bla bla); trace_runops(bla bla); conservative_runops(bla bla); thread_safe_runops(bla bla); } } AUTO_OP sys_opcode_change_runops { bla bla set run-flags.. longjmp(..) } In C++ I'd say throw the appropriate exception, but this is close enough. This would work well for fake-threads too, since each thread might have a different desired main-loop. You'd have to do something like this if you transitioned bewteen non-threaded and threaded anyway. -Michael
Re: SV: Parrot multithreading?
> Odds are you'll get per-op event checking if you enable debugging, since > the debugging oploop will really be a generic "check event every op" loop > that happens to have the "pending debugging event" bit permanently set. > Dunno whether we want to force this at compile time or consider some way to > set it at runtime. I'd really like to be able to switch oploops > dynamically, but I can't think of a good way to do that efficiently. If you're looking to dynamically "insert statis checks every op", then that sounds like picking a different runops function. We've already got a trace varient. We could farm out a couple of these and have execution flags specify which one to use. If you wanted every 5'th op to check flags, you could trivially do: while(code) { DO_OP(..) if(code) DO_OP(..) if(code) DO_OP(..) if(code) DO_OP(..) if(code) DO_OP(..) CHECK_EVENTS(interp) } The inner loop is a little bigger, but aside from cache-issues, has no performance overhead. This would prevent having to interleave check-ops everywhere (more importantly, it would reduce the complexity of the compiler which would have to garuntee the injection of check-events inside all code-paths (especially for complex flow-control like "last FOO". You could use asynchronous timers to set various flags in the check-events section (such as gc every so-often). Of course this requires using a more sophisticated "alarm/sleep" control system than the simple wrapper around alarm/sleep and $SIG{X}, etc. Other methods might be whenever a dynamic variable referencee is reassigned / derefed, an event flag is set to Q the gc, etc. -Michael
Re: Curious about Parrot + Closures
> I'm just curious, is there a plan for how closures will work in Parrot? I > think that closures are one of the coolest Perl 5 features around (despite > their memory leak issues :-), and I'd hate to see them go away. I doubt that there's any limitation. In Java, all they had to do was supply a reference to the enclosing object and compile a new class which used some local and some "friend member vars". We're obviously not fundamentally OO here so this approach won't work. We still haven't seen thow PMVs (or whatever they're called). They contain a structure. That's probably perfect for the reproduction of perl5 CV's which I believe contained (among other things) the "locals" which included references to closure values. In short. Fear not. -Michael
Re: Parrot coredumps on Solaris 8
> DS> At 12:29 AM 9/21/2001 +0200, Bart Lateur wrote: > > >> >Horribly wasteful of memory, definitely, and the final allocation system > >> >will do things better, but this is OK to start. > >> > >> So to stop it waste memory, subtract 1 first and add it again later. > > DS> Nah, it'll still waste memory. It can't not, because of the way it > DS> works. It has to ask for a big gob of memory and ignore bits to > DS> make sure it gets the alignment right. This particular > DS> implementation's not all that efficient I'm sure, but it'll get > DS> ripped out and replaced when we have a real memory management > DS> system. > > what about a binary type allocation system? > > we have multiple allocation structures in an array with each one > allocating the next size up binary size. so the lowest level does say 32 > bytes (or whatever our smallest granularity is). the highest level could > be in the multi-megabytes or larger. when a size is requested it go to > that size queue and if there is a free block on its list, it gets it. if > no block exists, it requests from a larger size and will break it > down. now, if you allocate with malloc and it is not on the proper, > boundary the preceding ram is broekn into smaller parts and put onthose > queues. so no (or little) ram is wasted. all blocks are allocated on > proper boundaries and you can do binary coallescing during (incremental) > GC. > > a first request of a small block may allocate 32 (or some other 2**N) of > them in a large chunk. that large blocks are managed by just slice off > the leading chunk and moving a pointer so that is very fast. when the > block is gone and the queue is empty, a new larger block is allocated. > > once the queues get populated some, it will be very fast. > > lemme try to whip out something like this soon and see what happens. or > does anyone else want to hack it with or without me? > > uri Why aren't we just using the perl5 memory manager? Cut-and-paste right? It used 2k page alignments for the most part. That should account for most of our alignment needs. Actually I think it put the size as the first several bytes. But this could be rectified by "spilling into the predecessor". Namely All allocations start at 2K but spill back 4 or so bytes. This means that free chains have to be <=4 above the grain. If you use pow2 based allocation instead of physical size-of-data, then you can assume that it's 2^idx-4 when allocating. It's more wasteful of memory, but with things like 2k-4B pages and clustered sub 2K allocations, it's very efficient. Until we get to that point, our current hack of a memory allocator is fine for any applications we're capable of writing with this boot-strap API. -Michael
Re: Wow.
On Mon, 24 Sep 2001, Buggs wrote: > On Monday 24 September 2001 03:27, Dan Sugalski wrote: > > At 01:47 AM 9/24/2001 +0100, Simon Cozens wrote: > > >http://astray.com/mandlebrot.pasm > > > > > >Leon, you're a sick, sick man. > > > > Okay, I think that means we need to weld in bitmap handling opcodes into > > the interpreter. :) Woohoo.. How many times have we seen code like this: sub log2($) { my $val = shift; my $idx = 0; $idx++ while $val >>=1; return $idx; } I'd love to see a full compliment of bit-code like the 386 had (oh the days). :) That includes get_index_of_left_most_bit (alias log2_i_i), get_index_of_first_[one|zero] (which is useful for searching bitmaps). count_ones. I'd have to go get my old [345]86 assembly book for the rest. This is useful stuff people. :) Heck, on x86 platforms would could use #ifdefs and assembly. :) Tell me this wouldn't be cool? -Michael
Re: Suggestion: register liveness information
> I have a suggestion for allowing parrot implementations to execute > code more efficiently. Add an "instruction" or other annotation which > denotes what registers are "live" at some point in the code. The Does it have to be in the instruction stream to be useful? Why not just be part of the constants /fixup segment. This way non-JIT execution won't require wasted op-execution (which we're devilishly trying to pre-maturely optimize :). You could profile things via perl6-code attributes which give clues to an optimizer or JIT (such as frequency of use, etc). In any case, I can't imagine that a JIT compiler is expected to quickly execute. What you have suggested is summary code, which can be determined at JIT compilation time. I'm not convinced that the waste of CPU for invocation of op-codes that aren't used in non-JIT form (which is the case when you know you're only going to run this once) is justified for such simplistic information. However, a similar idea of "live-registers" exists on newer CPU-architecture. Namely mapping how many registers need to survive the subroutine call. The block, in turn, specifies how many it will use within it's block. The CPU then frees non-maintained registers and allocates just enough. The IA64 does something along these lines. Here, however, we're not starved for register space; We're utilizing multiple stack-streams. The beginning of a block simply allocates new stack space on whichever stacks it requires. The only optimization in this manner is to have the caller specify if it has used less than the full compliment (or 32 or so), which would allow the potential stack-allocation to save a block of 8, 16 or 24 registers (useful when only 8 are used.. This also rewards compiler optimizatios which reuse registers; which has cache rewards as well). Namely: general-op general-op keep_int_regs %16 jsr _foo_func general-op _foo_func: push_int_regs // only shifts reg-stack by 16 (properly handles over-flow) push_num_regs // shifts a full 32 general-ops ... pop_num_regs // shifts back 32 pop_int_regs // remembered to shift back only 16 ret Unfortunately this trades off memory savings for extra CPU overhead. You have to maintain a seperate "shift-stack" which says how much to shift each time you do a push/pop. The current shift-value is set via the new op-code "keep_x_regs" which also introduces over-head. I personally would like to see smaller subroutines and more of them, but I don't know that we're really going to be wasting a whole hell of a lot of stack-space. At least not enough to be worth the CPU-overhead. In general, if the hint-information is useful for non-JIT execution as well, then I'd recommend it, but I'm not seeing this particular point as being worth the cost. -Michael
Re: Using int32_t instead of IV for code
> >>We're talking bytecode. That will indeed be a case of "huge arrays of > >>tightly packed integers". > > > >For bytecode, it's not a big problem, certainly not one I'm worried about. > >Machines that want 64-bit ints have, likely speaking, more than enough > >memory to handle the larger bytecode. > > I'm more worried about the cache. For 32 bit bytecodes, the same program > will be only half the size than for 64 bit. Or: you can fit a twice as > large program in the cache, or two of them (for multitasking). That will > mean a speed-up, and likely a vast one for programs with sizes close > enough the cache size. > If this is the case ten why are we using 32bit register identifiers? Obviously it makes code easier to write. But at some point are we going to compress the byte-code? Along with a previous email that suggested that byte-code is only going to be valid on a given machine with a pre-compiled parrot / perl6 core, the bytecode won't need to worry about the number of registers, etc. Most VM architectures use 8 or 16 bits for the op-code (including the register map). Here are the pros and cons as I see them: Cons: o 8bit op-codes dramatically limits number of "macro-ops" or advanced ops o 16bit op-codes have potentially screwy alignment issues. o 4bit register addresses have definatly screwy alignment issues (requires masking to extract values which takes more cpu-time) o sub-32-bit ops might be slower on non x86 architecture (since more and more are 64 or 32bit only; and require special ops to munch sub-32bit data; namely alpha) o If the constant-table index was chosen 16bits, we'd limit the number of entries to 64K per code-segment (But who would ever use more than this in p-code ;) o If the address was limited to 16 bits, we'd have code-size limitation of som 4K ops per code-segment. o allowing p-code-size to be an issue dramatically increases the number of op-codes earlier on in development (write this as a bug-enchancement instead?). Namely we have inc16_i8_ic8, inc16_i8_ic16, inc16_i8_ic32, add16_i8_i8_ic16, add16_i8_i8_ic32. o larger number of op-codes means more c-code, which means a trade-off between D-cache (for byte-code) and C-cache (for extra c-code). Pros: o dramaticaly compressed op-code size (from 128bits on average to 32,48,64). Saves both on disk space and cache-space - tighter inner-loops. o Potentially more highly optimized c-code; 16bit adds are somewhat faster on some architectures - do what's needed when it's needed. 64bit archs will upcast everythign anyway. o If we eventually determined a max-code-length (of like 64bits after alignment), then we could just make all code that big. This would dramatically reduce c-code over-head (no offset = DEFAULT_SIZE; offset += rel_address; return offset; .. code = offset;). This would additionally reduce the risk of jumping into the middle of an op-code. Heck we could do this now; simply profile all op-codes to determine the max code-size, then pad all op-codes to that size. Given that we're not into dynamic-opcodes, and most everything is being pushed into the constants area, I don't see much danger in it. Food for thought.. -Michael
Re: Using int32_t instead of IV for code
> At 05:32 PM 9/23/2001 +0200, Bart Lateur wrote: > >On Thu, 13 Sep 2001 06:27:27 +0300 [ooh I'm far behind on these lists], > >Jarkko Hietaniemi wrote: > > > > >I always see this claim ("why would you use 64 bits unless you really > > >need them big, they must be such a waste") being bandied around, without > > >much hard numbers to support the claims. > > > > >Unless you are thinking of huge and/or multidimensional arrays > > >of tightly packed integers, I don't think you should care. > > > >We're talking bytecode. That will indeed be a case of "huge arrays of > >tightly packed integers". > > For bytecode, it's not a big problem, certainly not one I'm worried about. > Machines that want 64-bit ints have, likely speaking, more than enough > memory to handle the larger bytecode. > That's not the problem. The problem is protability of the byte-code. I've heard various discussions about byte-code-compiling. One of which was that it is not considered cosher to let a program compile code and write into a "/usr/local/scripts" directory or any such thing. Many companies want to be able to distribute "byte-commpiled" code (such as with python). Though I don't know what the powers on high want, are we saying here and now that the byte-code is strickly non-portable? That parrot code is only as good as the machine it was compiled on? 64-bit compiled parrot-code will not work on 32-bit machines. If we're making that distinction, then I'm under the impression that a whole heck of a lot more optimizations can be made for the byte-code than simple word-size. Not least of which would be in-line assembly (from xs-code). -Michael
Re: Using int32_t instead of IV for code
On Sun, 23 Sep 2001, Bart Lateur wrote: > On Thu, 13 Sep 2001 06:27:27 +0300 [ooh I'm far behind on these lists], > Jarkko Hietaniemi wrote: > > >I always see this claim ("why would you use 64 bits unless you really > >need them big, they must be such a waste") being bandied around, without > >much hard numbers to support the claims. > > >Unless you are thinking of huge and/or multidimensional arrays > >of tightly packed integers, I don't think you should care. > > We're talking bytecode. That will indeed be a case of "huge arrays of > tightly packed integers". I was under the impression that the byte-code was all goin to be integer-based (op-code-id, reg-id, 32bit-int-constants, etc). Anything else should be in the constant pool. I'm not sure how damaging 32bit pointers are in a 64bit arch, but the byte-stream "addresses" are simply 32bit offsets. I can't imagine a 4.1Gig opcode chain. Thus as long as we allow long long int and [long] double in the constant block, I don't see a problem with compatibility between 32/64bit archs. The constant-declaration should be able to specify the granularity. -Michael
Re: Draft switch for DO_OP() :-)
> On Thu, Sep 20, 2001 at 11:11:42AM -0400, Dan Sugalski wrote: > > Actually the ops=>C conversion was conceived to do exactly what's being > > done now--to abstract out the body of the opcodes so that they could be > > turned into a switch, or turned into generated machine code, or TIL'd. If > > you're finding that this isn't working well it's a sign we need to change > > things some so they will. (Better now than in six months...) > > The problem is that the conversion currently done by process_opcodes.pl > translates the op definitions into functions, and leaves the remainder > of the file untouched. This is useful, because it allows the opcode > file to include headers, declare file-scope variables, and the like. > Unfortunately, when translating the ops into a switch statement in a > header file, there is no place to put this non-opcode code. I didn't have a problem with this. It just took some clever perl5 programming (namely outputting to separate streams and combining them at the very end). It's not fool-proof, but works for the important stuff (like includes / globals). > > There are a few approaches we can take. The simplest, I think, is to > ignore the problem when generating inline ops; given that these ops > are going to be compiled at Perl build time (they can never be > dynamically loaded for obvious reasons), we can manually put any > required #includes in interpreter.c. Files containing dynamically- > loaded ops can be generated in the same way that process_opcodes.pl > does now, preserving the file-scope code. > > Another approach would be to include a means of defining information > that must be included by the file implementing the ops. For example: > > HEADER { > #include > } > > This would then be placed into interp_guts.h. (Possibly surrounded > by a conditional guard (#ifdef PARROT_OP_IMPLEMENTATION), so no file > other than interpreter.h will pick up that code.) > > - Damien > - /= |_ |_| /= /= \./
switch-based interpreter
I don't remember if this has already been experimented with; I've lost track of what the previous sets of benchmarks were for. I was curious to see how much faster a single-function switch based interpreter would go. I took process_opfunc.pl and created process_opfunc_switch.pl which does the work. I added a few additional optimizations such as copying many of the interpreter variables into local variables (to reduce the indirection by one). I then redefined the register access macros locally to account for this. I also mungled "RETURN" keyword to directly increment the code pointer. The only problem I encountered was that the existing parser wasn't designed to adapt well to the new "nested code", so the comments outside the functional blocks (of basic_opcodes.ops) are no longer are near their respective code. (Also identified a bug in process_opfunc.pl, which I'll post patches for later). Using a 466MHZ dual celeron and a 600M instruction loop (bench.jako with different limits): Stock-code default compiler options8.2M ops/s Stock-code -O3 with the works 13.5M ops/s switch-baseddefault compiler options11.5M ops/s switch-based-O3 with the works 28.9M ops/s # for reference a perl -e '...' was run with equivalent code perl5.6 (redhat built -O2) 4.2M ops/s I actually run code with and without the bastardized local copies of INT_REG, NUM_REG. W/ default compiler options this made a difference of several Mops/s, w/ -O3 the time difference was neglegable. One additional optimization that I'm going to try and make: The *.ops code is not ordered with respect to the op-code-id, so the switch statement couldn't be optimized by gcc (it just created a local jump-table, which essentially is the same as vtable[*code] minus the minor overhead of subroutine stack operations (which occur in parallel). I'm going to redesign the compiler such that the switch statement is numerically ordered. I believe this will cause a direct PC + *code jump statement. Lastly the highly tuned compiler decided to litter the code with branch statements. I'm guessing it was to maximize code-reuse (possible reduction in cache size??), though this thwarts much of the point of having code blocks close to each other. The end result is that the performance gain is not spectacular (at least with this minimally cached celeron). The test times were between 140s and 23s with the lower ends having only 10s of seconds between them. There was a varience of 2 or more seconds for consecutive executions, so this isn't definitive. I'm away from my usual internet connection so it' inconvinient to post patches and files. If anyone really wants to see them, I'll make a special effort. -Michael
Re: A task independent of the freeze
> On Fri, Sep 21, 2001 at 02:24:43PM -0400, Dan Sugalski wrote: > > Doing this by hand with -O3, you can see a speedup of around a factor of 45 > > over an unoptimised runops loop, so it's definitely worth doing in some > > cases... > > Cool! Parrot JIT! > > But that tells us *just* how time-consuming our runops loop is. > :( > I think the slowest part is the function-call. But this 45x performance boost is probably measured via x86. SPARC has a very nice function-call optimization (no-main-memory needed). Given that the inner loop probably maxes out at 5 function calls deep (do->op->api_func->c_func..), SPARC is optimal for parrot. Simply going to a switch-archetecture gets rid of the "op-code" function-call. The only remaining slowdown is the code-branching (which thwarts pipelining / instruction-level-parallelism). Either way, branching can be reduced by a clustering a sequence of regularly used-ops together into macro-ops. I'm not of the opinion that concatenating everything into one monolithic function is of use (converting jsr/branch to gotos), since it doesn't extend well to dynamic operation (redefining functions / dynamically loading code). A c-function call isn't that much more destructive than a non-local goto, so I'd recommend these "macro-ops" be comprised of 2-or-more op-codes grouped into a function (propbably using a hash-coded function-name: F124x3ba for example). Using hash-tables at compile-time, it would be possible to reuse macro-ops throughout the code. The more code-reuse (through variable-complexity statistical code-analysis; e.g. -O2, -O3, etc), the better locality and cache-coherence, and the smaller the executable size. What's more, it might even be possible to reuse these macro-ops in eval / "use" statements if the compile-info is maintained until run-time. The macro-ops can be used either as a free-lance collection of functions (which are compiled and linked-in dynamically at run-time (portability-issues?)), or as a monolithic switch-statement (of ops + macro-ops). Note, there is little need to group the macro-ops together into a monolithic function, since the entirety of the main-code could become a single macro-op (which optionally inlines the other macro-ops). Ideally the ultimate jit would generate assembly and concatenate this into an existing executable memory segment (self-modifying code). (yeah right) As a plan of attack, I can see the first stage producing a monolithic function (as with the original request), then later breaking out function-blocks (anything in brackets, including the eventual subroutine) into macro-ops, then just for fun (or when there's time) performing statistic analysis to maximize code-reuse. As a further note, macro-op code-reuse would be most efficient if we had a manner of generalizing parameters. This isn't a problem in stack-based code (such as java), but we have to deal with it due to mappable registers. Two code segments may only differ in the specific registers utilized (based on compiler allocation), thereby thwarting consolidation. As a possible solution, the macro-op is provided the complete set of register-addresses and ignores those specified by the byte-code-generator. Namely: Original byte-code segment: set I1, {constantA} set I2, {constantB} add I3, I1, I2 The macro-op _could_ be: F12345(code,interp): REG_INT(1) = {constantA}; REG_INT(2) = {constantB}; REG_INT(3) = REG_INT(1) + REG_INT(2) Note how we've hard-coded the reg-addrs. This function takes no parameters. My proposal is: F12345(code,interp) REG_INT(P1) = P4; REG_INT(P2) = P5; REG_INT(P3) = REG_INT(P2) + REG_INT(P3); The above can be interpreted as either a generic function with extended args which will be called with F12345(code,interp, P1, P2, P3, P4, P5) inside another macro-op, or via a custom-built p-code which acts in all ways identical to our original p-code except for the dramatic expansion of op-codes. What's interseting about this approach is that it can use the same interpreter for optimized code. The main functinoal difference is that we've reduced the size of the byte-code by several orders of magnitude. Inner-loops can possibly fly (w/o any function-calls nor non-local branches). The calling sequence could be as follows: edit foo.pl perl6 foo.pl compile foo.pl, generate pasm (output if -S option) assemble pasm into pbc (output if -o option). If (-Ox): generate macro-ops from pbc generate non-portable-pbc from macro-ops compile macro-op-code to executable code (DLL/so where available) output if -b option, else write to temp-file if can-dynamically-load: load library; else: exec new-code run interpreter on pbc Note: you can invoke perl6 with .pl, .pasm, .pbc, .npbc:.nplib (non-portable-byte-code and it's associated library) Food for thought...
do-loop too restrictive?
Question. It seems to me that the current do-loop: while(code > x && code < y && *code) { DO_OP } Is fail-fast and succeed-slow. I know it has a brother: "while(*code)", but this could lead to seg-faults, especially if we allow dynamically modified parsers / compilers. The first method has an even more significant problem though. Namely that the interpreter can't possibly know how big the code-segment is, since there is the possibility of dynamic loads of p-code or dynamically generated "eval-code". This is only an issue if parrot assumes a late-bound compiler, but I'm under the impression that it's strictly compatible with perl5 which obviously is. >From this, an earlier bias I had is further reinforced. Since we don't know the upper and lower bounds of allowed code, and it would be nice to avoid seg-faults, I recommend: while(code) { DO_OP } This is success-fast, and fail only requires execution of the "end" op-code. (which only executes once for the whole program). It's redundant to check for the end-op as with while(code && *code){} and it only hurts mainline performance (though I know that's not yet an issue). To maintain strict p-code checking, you'd have to dynamically load the extents of the code-segment (most likely from the interpreter), as with: while( code > INTERP->start_code && code < INTERP->end_code ) { DO_OP } jsr's and ret's might modify start/end extents, and so on. (Again *code is redundant). A "premature" optimzation could be to have nested do-loops: func runops( code, interp ): while( !interp->f_done ): _innerrunops( interp->start_code, interp, interp->start_code, interp->start_code + interp->code_size /* other run-time constants */ ); CHECK_SYSTEM_STATE // potentially avoids litering CHECK_EVENT op-codes func _innerrunops( code, interp, start, end /* others */ ): while( code > start && code < end ) { DO_OP } This would assume that far-jmp / far-jsr (those based on dynamically linked code) set interp->code and return zero. I'm not crazy about it; if nothing else, it requires an extra c-function call, but it might be necessary when we get to dynamic linkiing anyway to additionally configure the interpreter state (switching current packages, etc) I've only recently gotten into this mail-group, so I've missed any previous justification for the existing use of while(*code), nor any direction involving dynamic linking. -Michael
RE: question about branching/returning
On Thu, 20 Sep 2001, Brent Dax wrote: > Damien Neil: > # "RETURN(0);" (written exactly like that, no variation permitted) > # is a special case, and terminates the runops loop. The only op > # which uses this is "end", and it doesn't actually ever execute. > # Personally, I feel that this special case should be removed. > > We should probably just check if the thing returned 0 and exit runops if > that's the case. This will also protect us against utter stupidity > like: > > FOO:jump FOO > > which wouldn't be a bad thing to protect against. To my understanding jump FOO would physically map to the c-code "return code + code[1]", which means the number zero is not returned, even thoughthe offset of FOO is zero bytes. (I'm away from the source code so I can't verify this, but it's what I remember) -Michael
Re: RFC 310 (v1) Ordered bytecode
> Ordered bytecode > > Bytecode should be structured in such a way that reading and executing > it can be parallelised. > Are you suggesting a threaded VM? I know that the core is being rewritten, so it's a possibility. If this is the case, then you'll want to reference some of the other RFC's relating to threading. It's an interesting idea, and it reminds me a lot of IA64 chained lists of non-dependant instructions. The only way I can imagine this being useful in a non-optimized fashion is if raw async-io is performed. Namely, read a minimal chunk, read the chunk, figure out where the next chunk is (if it's not already cached), initiate an async read on the newly determined chunk from disk, begin execution / initialization of the current chunk until the end of the chunk is reached. In bound async-io (through interrupts) appends itself to the input queue. If that model could work (unfornately, doesn't work on all OS's), then it would make maximal use of CPU time (especially on single-CPU's), without dealing with race-conditions and other such evils inherent in MT programming. In general, however, I don't see bytecode reading as being the real bottle-neck. Take the following virtual charts: Compile Time / Byte-code read Time / core execution time # comment 1 / 2 / x # faster to compile out-right. Need to reimplement byte-code so it does less. parallel byte-code doesn't help any 4 / 2 / .5 # we actually get a performance boost here from the byte-code, so it's not as critical that we shave off .4 total time (assuming that's even possible) x / 2 / 20 # compile / loading time is irrelavant compared to the whole execution time. 4 / 2 / 2# here is the main candidate for your parallel loader. We could possibly interleave execution with byte-loading, thus shaving total time from 4s to possibly 2.1 or 3. One variable is whether the extreme loading time is due to excessive disk-size, or to execessive computation. If it's due to disk size, then we have more problems to deal with, and a complete redo of byte-code should be done. If it's due to computation (memory allocation /etc ), then single-CPU implementations aren't going to gain anything. In fact, the additional context switching involved in MT for most systems, is going to provide more overhead than you are likely to save in the operation. The only real winners are multi-CPU systems that are mostly idle or otherwise dedicated to many start and stop operations (say in a parallel make that heavily utilizes perl-code). My suggestion (which should probably become it's own RFC), is that we not store raw byte-code to a ".p[lm]c" file, or what-have-you. But instead store a non-ambiguous, token-based version of the original english code. Essentially, there'd be 2 front ends to perl, the english-eval, and the token-eval. Everything else is the same. The reason for this is that the source code seems to be an order of magnitude smaller than the compiled code, and there seems to be a great deal of compilation going on in order to map that file to memory. With the token'd approach, the .pmc file would be significantly smaller than the original .pm file since there would be no white-space, and all operators / local-function-calls would be reduced to single byte operations. Algorithms and optimizations could be added to the token file as an expanded list of attributes that modify functions / variables (such as whether something is an int, etc). The goal would be that the code analysis and optimization stages would be removed from the compilation. I don't know what pascal or python use for object code, though I'm sure they're effectively what perl's byte-code turns out to be (albeit significantly smaller). Java takes the whole approach of a true virtual machine, and uses assembly-type-languge accordingly. Perl doesn't really seem to be able to take that route, but we'll see. It's an interesting topic, but we're not done with it yet. -Michael
Re: RFC 301 (v1) Cache byte-compiled programs and modules
> So, we check for the existence of a C<.plc> file before running a > program; if the C<.plc> file is newer than the program, we use that > instead. If there isn't a C<.plc> file or it's older than the program, > recompile and dump the bytecode to a C<.plc> file. Naturally, this gives > us the best speedup for modules which change very, very infrequently, > rather than programs which can change a lot during development. Maybe > you only want to do this for modules, then. I suggested this a while ago, and the response was that automatically writing files is a security risk. You should extend your RFC to describe a caching directory or configuration. Python optionally performed the disk-write - eg if the write failed, it went on about it's business. The closest that I could image to security would be compiling to .pmc files at perl / module install time (by a privledged user), and to disallow write-access to site_perl for anyone else. The creation of a .plc is somewhat different (since there's no installation process). As a security risk, tainted code could write out a binary file to replace the .plc file, and thus introduce a hidden virus that would be almost indetectable. Since the .plc file would be newer, it would never get recompiled (until a code change). Any sort of code analysis wouldn't reveal the existence of a virus either. The closest you could do would be to write a .plc checker, which regenerates the .plc and compares it to files. Other issues are simply disk-space considerations. .plc might be large, and dozens or hundreds of scripts would consume a great amount of disk space. Then you have to worry about compatibility. Copying .pl* files from computer to computer _might_ cause problems if not implemented correctly (namely that a perl and module version is associated with each .p[lm]c file). In general, I'm for the idea, but I figured I'd pass on the obsticles I encountered. To further your RFC, I'd suggest working out a configuration for a .*c cache directory, which could put upper limits on disk-space usage for compiled files. Again, this is more of a consideration for .pl files which could live just about anywhere on the system. I'd suggest things like cached modules candidates require a non-zero value of VERSION. That's all I can think of at the moment. -Michael
Re: RFC 172 (v1) Precompiled Perl scripts.
- Original Message - From: "Perl6 RFC Librarian" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Tuesday, August 29, 2000 10:20 PM Subject: RFC 172 (v1) Precompiled Perl scripts. > This and other RFCs are available on the web at > http://dev.perl.org/rfc/ > > =head1 TITLE > > Precompiled Perl scripts. > > =head1 ABSTRACT > > The functionality to store/load precompiled perl scripts may be built into > the core. You're on the right track. May I make a suggestion. Ok, I'm obviously biased towards Python. But Larry _did_ say that we were allowed to steal back from other languages. The way python works is that in the library tree, it looks for any of the following: .py, pyc, and .so. You can probably guess what's what. If only the .py was found, it immediately writes out the .pyc when it finishes compiling. This happens every time (unless it can't write to the directory). In perl, we already have .pm and .so. They're in different directory trees, but that's mainly because of architectural differences.. No reason why a .pmc couldn't be written to the site_perl directory on the fly. The standard lib should probably generate the .pmc's at install time. I would make a seperate RFC, but your title is too similar, so I'd rather coerce you into this line of thinking. > > =head1 DESCRIPTION > > There exists a possibility to do the thing. One can store compiled perl > code in file and load it again via B::Bytecode. But the produced file > is large. The loading time is comparable to compiling the code anew. > Several simple experiments show that disk i/o times does not affect too > much - almost all data are in cache. The B extension mechanisms and > algorithms are too slow. Therefore it is unusable. Obviously, perl still needs some work on the op-tree storage / retrival mechanism before this becomes practical. > > Why bother? Consider the script that uses xml/xsl. It takes a lot of time > to get such a script running. And the debugging process becomes harder > with annoying initialization (every single .pm must be read by OS from > numeric files and compiled from the very beginning). Another example is > web server. Why loose CPU time to compile the same scripts? > > =head1 IMPLEMENTATION > > Perl may have an option to vary the target precompiled format. The format > affects the generated precompiled file size. User may choose appropriate > effective format to minimize loading time. It clearly depends on the > system which runs the script. > -Michael
Re: RFC 146 (v1) Remove socket functions from core
> >I don't understand this desire to not want anything to change. > > You misread. I sympathise. There are definate goals and focuses that each language is built around.. Change these too much, and you have a different language, while at the same time, alienating the people that chose that language because of those met goals. Perl is fun, functional, and efficient. I can artistically scuplt a program in one line at the command prompt and do things that wield incredible power, and feel an incredible sence of happiness. While at the same time, I can write a massive program, and not have to worry about many of the mundane details that we learn in algorithms / data-structures. > >This is an > >opportunity to clean up the language, make it more useable, and more fun. > >I would have a lot more fun if perl were a better performer and if it was > >easy for me to expand it, contract it, reshape it, improve it, etc. > You will *not* improve the performance of the inner interpreter > loop by having fewer opcodes to switch on. Whether the number is > 20 or 200, it's the same thing--*think* aboutit. Well, not strictly true. Though the raw code will not switch any faster, the bloat of the core will hurt you with non-locality, through cache misses, or worse memory-paging. A highly tuned 50K core interpreter will do much better than a 500K list of regularly used libraries. Essentially, it should be possible to perform profiling on suites of programs to find out what is really most needed, and only allow them in the core. The rest could be delegated to slower dynamic linking or outer-edge statically linked code. It's the general idea of RISC, of course. I have no real backing to say that this would actually improve performance, nor that it would even be possible to stream-line the core of perl any further.. Hell, C++ is only going to make the situation worse (performance lost to fun of development, which is acceptible). Typically, when the core does nothing more than handle basic flow, and memory management, you can have a rock-solid kernel. If you can build out apon this (all op codes seem identical, beit built-in through statically compiled extensions, or linked in), then you will have a much more solid development base. This is the advantage of C, python, java, etc. Their extensions feel just like built-ins. Granted, perl has come a long way with giving keyword power to user-defines. > Furthermore, It's > been demonstrated that this supposed "gain", including in size, is > negligible. I have yet to see one iota of justification for denuding > perl of its math functions. I would agree that math function should not be compromised. If dynamically linking them seriously decreases performance, then we're going to see a massive slow-down in operation, since some crazed programmers actually like performing fast fourier transforms. Getting a 20-30% slow-down is not going to be well accepted (I'm guessing at the over-head of dynamic linking; I just hear how bad it is). As above, however, I don't see a problem with keeping them statically linked, so long as they're not part of the core. > If math functions should be removed, > then so too string functions. And that binmode silliness is in > line before any of them, since it's absolutely useless because I > never use it. As in all things in life.. Moderation.. Profiling should decide what compromises clean-ness of code for the benifit of optimization. Many string operations are close to memory management, such as array modification, string concatenation, simple assignment, references.. I would advocate that they be part of the memory management system, and thus part of the core. > Perl is *not* fun when it supplies nothing by default, the way C does(n't). I hear ya. Dynamic memory management (garbage collection, dynamic arrays, etc), powerful string operations, powerful datatypes, etc. These are what attract me to perl the most.. Not so much readily available socket functions. Sure having java.util.* built into java would make it easier to use out of the box, but it's more to learn about the internals, especially since they may change over time. I can more easily walk the lib-tree and read pod-files about each library to learn what I can do. > > If you want a language that can do nothing by itself, fine, but don't > call it Perl. Given these: > > * Scalar manipulation > * Regular expressions and pattern matching At the very least, built-in reg-ex's allows the compiler to make incredible optimizations which it simply wouldn't be able to do if it were a function call: $str =~ s/${var} xxx \s xxx/ abs( $1 ) /xegso; would have to be: $reg_ex = gen_regex_replace( expr => "${var} xxx \s xxx", subs => "abs( $1 )", modifiers => "xegso" ); apply_reg_ex( regex => $reg_ex, str => $str ); It would be non-trivial to get a compiler to efficiently handle this.. Thus, I don't support any removal of expressions. I definately don't support maki
Re: RFC 146 (v1) Remove socket functions from core
I would actually further this sort of activity. I admire micro-kernel-type systems. C and Java give you no functions out of the box. Keywords are just that, keywords. I believe python is like this as well. The idea being that everything has to come from a module.. This limits how much a new developer has to learn, and it doesn't pollute the global name-space pool. Though I would advocate OO interfaces to everything, I'm sure that it would defeat a lot of perl's flexibilty, and so I'd support the simple Exporter method. >From this, socket, and virtually all IPC methods should go the wayside. This includes pipes, shell environment ops ( the get and set functions ), and even the file-io (open, close, and possibly even printf, sprintf). At this point, it gets a little hairy, since arguably, everybody wants print and <>. I would suppose that most want open as well. My personal taste is to use IO::File, but that's just me. Since these are just hooks into standard C libraries, I don't think that they really take up any room. But it is hard to justify keeping some of these less used features while taking out others (like FORMAT). -Michael