Todd and I read through your treap implementation today.  Very nice, indeed.  
We did notice that there's a minor issue on line 356 (and similarly 357):

arg1Type ::= ordMetaType's parameters'type[1];

Although it might not be apparent from the method name itself, you should 
probably add a space between the apostrophe and "type[1]".  Breaking down the 
name "_'s⁇type", we expect an argument, an apostrophe, an optional "s" token, 
and a "type" token.  Whitespace is currently always allowed (but not required) 
around any operator character.  Each operator character is parsed as a separate 
one-character token, which vastly simplifies parsing compared to C++ or 
Smalltalk (much standardization effort was spent defining how 5--3 should be 
parsed).  A strange thing you might have noticed is that in Avail, expressions 
like "x:=5", "x := 5" and "x:   =5" are all parsed as an assignment (which will 
soon be just a use of the "_:=_" macro).

Eventually we may rework whitespace rules to support forbidden, mandatory, 
optional, and significant whitespace between tokens.  The last one is to 
support whitespace-sensitive syntax like Python.  That functionality would show 
up as changes to the interpretation of method names (see MessageSplitter.java), 
as well as analysis of the whitespace that we started capturing inside the 
adjacent tokens several weeks ago.  We were thinking of a space indicating 
required whitespace, a space followed by the double question mark character as 
being optional whitespace (except between alphanumeric tokens), and no space 
indicating whitespace is disallowed.  Alternatively we may allow specific 
whitespace characters like tab (\t) to be back-ticked (`) to indicate precisely 
which whitespace character is required or optional (with double question mark). 
 Maybe a back-ticked new line (`\n) would indicate a forced line break plus 
maintaining the current indent level.  And a back-ticked new line followed by 
one or more back-ticked tabs would indicate a Python-like indentation increase 
by the specified tab count.

I'm not sure what your issue #3 is about exactly, but I suspect it has to do 
with functions being composed of raw functions (CompiledCodeDescriptor, also 
known as function implementations) and outer variables.  The raw functions have 
identity because they have to track (in a secret atom stored in 
CompiledCodeDescriptor.ObjectSlots#PROPERTY_ATOM) some properties like the 
defining module, starting line number, and a suitable name for a function based 
on this raw function.  And we'll add more properties like annotated parse trees 
and hyperlinks as we extend our tool support.  The name property gets set when 
a function is used as a method body, semantic restriction body, macro body, or 
anything similar.  When the raw function is given its name (e.g., see 
P_253_SimpleMethodDeclaration, line #101), any raw functions lexically nested 
in it are given suitable names derived from the parent function's name ([#1] of 
[#3] of "_+_").  Those names are what make Avail stack traces useful, so 
equating and coalescing functionally equivalent raw functions would cause a big 
usability problem.

But getting back to identity... a raw function is defined with fiat identity.  
Its o_Equals() and o_Hash() operations cause two raw functions to be considered 
equal only if they are represented by the same (Java ==, after following 
indirections) AvailObject.  Generally speaking, a raw function is lexical:  it 
corresponds to a particular sequence of characters at a particular position 
within a particular source file.  So two raw functions that were created from 
different source files or different regions of the same source file would be 
treated as distinct.

A function is just a combination of a raw function and its captured outer 
variables.  Its equality test compares these corresponding parts, and its hash 
value is derived from these parts.  But since the raw function has identity, 
the function sort of does, too.  Two functions created from the same piece of 
code, like the "[1+1]" in the expression "map 1 to 10 through [x : integer | 
[1+1]]" are equal if their captured variables (outers) are equal.  In this case 
there are no outer variables, so the ten functions are equal.  But in "map 1 to 
10 through [x : integer | [x]]", the resulting tuple of functions are expected 
to produce different values when evaluated (e.g., the 3rd one is functionally 
equivalent to the function [3]).  They each capture the x argument from the 
enclosing function, so this time the tuple contains ten distinct functions.  
This is probably about as useful a mechanism as we can provide without 
sacrificing a lot of optimization opportunities.

A further detail makes this even more complicated.  The level one nybblecode 
instruction L1_doPushLastOuter pushes the specified outer variable of the 
current continuation's function.  If, however, that function is still mutable, 
it is free to clobber that outer variable slot (with the Avail integer 0), 
rather than force the variable (or the actual value if it's a function argument 
like x, above) to become immutable (due to both the newly pushed stack slot and 
the function holding references to the same value or variable).  But, assuming 
the function actually was mutable, after we've zeroed that outer variable slot 
of the function, its hash calculation will be different from what it would have 
been before the L1_doPushLastOuter instruction ran.  And it will no longer be 
considered equal to a function for which the raw function and all the rest of 
the outer variable slots are intact, but the outer variable slot corresponding 
with the clobbered slot still holds its previous value or variable.

This is a tricky situation for the VM.  On the one hand, we want functions to 
behave like completely immutable objects to keep the execution model clean, 
acting as identitilessly as possible.  On the other hand, we want to continue 
to keep values mutable as long as we can, including functions.  I think the 
current mechanism is a pretty good compromise, but in theory you might run into 
bizarre problems whereby you could compute the hash value of a (mutable) 
function at different times and get different answers.  Technically, this isn't 
currently a possibility, since whenever you hold one of these mutable functions 
in a variable in order to extract its hash or compare it against another 
function, it will already have been made immutable.  The one other concern I 
have is that the list of outer variables that are still "live" (not zeroed) 
depends on the time at which you make the function immutable.  I believe it's 
currently impossible to observe distinctions between functions that are based 
on the same raw function but with some slots zeroed in one function but not 
another.  That's because there doesn't seem to be a mechanism to access the 
current continuation's function without also making it immutable at the same 
time, and there can't be two code positions that capture the function at 
different points in its execution.  Either the paths capture the function at 
some particular instruction or they don't.  If a particular instruction 
captures it, the function is made immutable, suppressing any other outer 
variable zeroing for that function.  If the instruction doesn't capture it, the 
outer variable slot of the function won't be affected – but we unconditionally 
won't be capturing the function there, so we can't compare the functions 
created by the two situations: they're statically mutually exclusive by 
construction.

So long as we're mindful of L1_doPushLastOuter's issues when building tools 
(and we have to be aware of other mutability-preserving L1 instructions 
anyhow), we shouldn't mind too much that some outer variable of the function 
may have been clobbered with a 0 by the time we actually look at it with a 
debugger or whatever.  If we eliminate the mutability-preserving L1 
instructions we'll end up with an extremely clean model... but the L2 optimizer 
will have to satisfy that clean model in its off-ramps by making a lot more 
things immutable.  And for the L2 off-ramps (places corresponding to 
inter-L1-nybblecode gaps) to be able to accurately generate (on demand) the 
values that will appear in reified L1 continuation slots, L2 will have to 
preserve more information and recycle fewer mutable objects.  So a perfectly 
clean L1 prevents a lot of (proposed) optimizations in L2 which would simply 
wipe values that the existing L1 semantics say can be safely discarded by the 
execution machinery.  I'm not perfectly happy with this model, but it's the 
best compromise that we've attempted.

And finally... I published a fix for #2 -- better integer hash distribution.  
Now you should be able to safely remove your _'s⁇rehash" method.




On May 28, 2014, at 2:11 PM, Robbert van Dalen wrote:

> hi,
> 
> i’ve done a treap implementation (v0.1) in avail.
> you can find the stuff here:
> 
> https://github.com/rapido/spread/blob/master/avail/Treap.avail
> 
> three things i’ve noticed:
> 
> 1) the performance is not so good
> 2) integer hashes are not very well distributed
> 3) the hash of an immutable function is not deterministic
> 4) lurking avail’s (public) mailing list is not possible anymore
> 
> 1) will be solved within the couple of years - but i do want to hear from you 
> what are currently the performance killers.
> 2) is easy to fix
> 3) is more fundamental with no clear answers
> 4) ?
> 
> other than that, i’m greatly surprised by avail’s powerful type system - in a 
> very very positive way!
> also, the compiler’s error messages were very helpful in steering me in the 
> right direction.
> 
> man, avail is completely changing my thoughts about how computer languages 
> should work.
> avail has been the subject of my day dreams for the past 4 weeks.
> 
> further steps: add a blog entry!
> 
> thanks and cheers,
> Robbert.

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to