Subramanya Sastry wrote:
This note interweaves code optimization with discussion about IR because
the IR choice will be influenced by what you want to do with it. I am
going to use examples below to guide this discussion. A lot of it is
somewhat 'stream-of-thought' writing -- this is not a tight spec.
Thanks for jumping in Subbu :)
Here 'call' could be a regular C call for all you care. So, what would
happen in the compiler is that you would first transform the code into
a higher-level IR (with virtual calls as above), and at some point in the
optimization, you will ratchet down the IR one level lower that might
expose other optimizations such as this one. If you took care to maintain
the SSA-ness during transformations, optimizations such as eliminating
duplicate method look ups can be done by the SCCP (sparse conditional
constant
propagation) algorithm which is linear and pretty fast for all practical
purposes.
Yes, this fits what I was thinking as well. I believe part of our
problem in optimizing Ruby is that we only ever represent things at the
highest levels, ignoring the fact that there are very important
low-level operations (like function retrieval) that should enter into
analysis. So in essence, we've been going straight from our high-level,
coarse-grained IR (the AST) to the target language (JVM bytecode) with
only primitive analysis in between. What we want is to decompose the
high-level IR (perhaps after first making it a bit nicer than an AST)
into the *actual* component operations, and performing multiple levels
of analysis before producing the eventual result. Yes, I see it.
This also enters into what we talked about at the coffee shop, where
because MRI is green threaded we have some implicit boundaries at which
we must propagate cross-thread changes to global state (like method
tables). And that if we represent these operations at an IR level, we
may be able to coarsen things like method override checks until the next
implicit thread-context boundary. Very nice.
Open classes & JIT => All optimizations have to be guarded
----------------------------------------------------------
...
But, if you are going to JIT Ruby code and enter JVM/HotSpot land, you have
to build in optimization guards throughout the code. This means the IR has
to incorporate guard instructions. For example, here are few types of
guards:
Ahh...yes, this is exactly what I had in mind!
type_guard(o, Fixnum, L) <-- if type of o is not Fixnum, jump to L
class_guard(Fixnum, L) <-- if Fixnum has been monkey-patched,
jump to L
method_guard(Fixnum, '+', L) <-- if Fixnum.+ has been monkey-patched,
jump to L
What we have been doing currently, with our very limited optimizations,
is just "crossing our fingers" and hoping things like Fixnum#+ don't get
overridden. But I think if we're able to make these sorts of guards just
another part of the IR, we'll suddenly have a lot better opportunities
to do the right thing every time, without having to make assumptions.
The guards will perform their checks, the results will propagate through
the code, and when we hand off to Hotspot most of them will get folded away.
I'm also proceeding with this discussion from a perspective that ignores
the JVM's limitations on bytecode size. Adding more logic to a method,
such as to have fallback branches when guards fail, will often bite us
because of Hotspot's code-size and graph-complexity limits. But I think
the opportunity here is to produce a flexible enough compiler chain that
we can move that optimization bar up and down easily, taking advantage
of newer JVMs as they come out or allowing more optimizations if
additional Hotspot flags are provided.
The decomposition you're describing will definitely give us more
flexibility.
Clearly, there is a lot of overhead unboxing the objects -- but that is
because
the objects are being passed in as arguments. But, you can see that if this
expression had one or both literals, the overhead drops. If both were
literals,
the code reduces to boxing the result into a Fixnum. But, if not,
HotSpot might be
able to inline the 'value' calls at this site and get rid of the
unboxing overhead.
Yes, especially in Java 7 where the available escape analysis is much
more robust. If we are forced to allocate boxed values, we need to at
least make *damn* sure we can inline the code that produces or consumes
those boxed values, so we have a chance for them to compile and optimize
as part of the same compilation unit.
The experimental "dyncall inlining" flag in 1.3 may already be enabling
this, since I can see multiple levels of "fib" inline together into a
single unit of asm. If escape analysis is working like we expect, we
should see the boxing cost disappear for several levels of calls, only
returning when we have to actually CALL into a new body or RET back to a
previous one.
JRuby Inlining
--------------
Since HotSpot won't really be able to optimize the unboxing and add
type-based guards
as above, at some point, you might have to embrace method inlining. For
inlining,
you will use the high-level IR.
...
Clearly, this is still simplistic because you might be boxing the 8 into
a Fixnum
prematurely, and it might never be needed. So, in order to be able to
optimize
boxing and unboxing, it might be useful to introduce IR instructions
that encapsulate
boxing and unboxing of primitive types. Since the target is the JVM,
you will
introduce boxing/unboxing IR instructions for JVM primitives. Fixnum
and Array
are obvious candidates.
So, box and unbox IR instructions could be:
n = box_primitive(Fixnum, v)
v = unbox_primitive(Fixnum, n)
If I remember right, I've seen similar operations in MacRuby's calls to
LLVM, and I presume they get the same optimzation from it that we want.
With this, the inlined and optimized code above would become:
...
method_guard(Fixnum, '+', L1)
n = box_primitive(Fixnum, 8)
jump L2
L1:
args = [5]
n = ocall 3, '+', args
L2:
...
Again, we have two perspectives:
If we assume no limitations on the code size and complexity we can
generate (or if we assume we will always be able to work around those
limitations) then emitting inlined logic with appropriate guards will be
fairly easy. As you show here, we'll perform our method and type guards,
with the fast path immediately following successful tests and a branch
to a slow path following failed tests.
If we are pessimistic and assume that code size limitations are going to
bite us, then we need to be fairly selective about what we inline. Math
operations are obvious, as are other core class operations and
frequently-used closure-receiving methods like Array#each. Again, having
a flexible compiler chain will allow us to slide this bar up and down.
Reality will surely lie somewhere in between, but I think we should keep
both extremes in mind while realizing there's more "future" than there
is "present" waiting for the fruits of our labors :)
The next problem is to propagate type information throughout a method to
allow
optimizations to be as global as possible. In the optimization
analyses, you
have the option of carrying around the state implicitly (carried along
in hashtables
while running analyses), or explicitly (by introducing attribute
instructions in
the IR). The implicit approach is easy and preferable for cases where
information
about a variable is not use-site specific (constant propagation, for
example).
But for cases where this is not true, the explicit approach is
preferable. For
example, the type of an object along two paths of a branch would be
different.
So, a variable 'v' (in SSA form) might have type 'Fixnum' along one path and
'String' along another.
I prefer the explicit approach:
- they lend themselves to easier debugging because when you dump the IR,
you can
see what information is being generated.
- you convert use-site specific information to attributes that are no longer
use-site specific (because every branched path gets its own SSA
attribute).
- by converting site-use specific information to constants, you can
incorporate
transformations as part of regular constant propagation via SCCP.
I have used this approach to good benefit in my JVM compiler I wrote for my
dissertation -- I used this approach to eliminate null checks and array
bounds checks.
Yes, it seems to me the explicit approach would provide a much wider set
of opportunities for us. A key example is turning this:
5.times {
# stuff
}
... or this ...
a = 0
while a < 5
# stuff
a += 1
end
...into a primitive loop. That's only possible if we're able to tell
that at each of these sites 'a' is always provably a Fixnum (assuming
Fixnum operations have not been violated) and then producing a
call-free, boxing-free loop. This is where we get into the real magic
that a good IR and tool chain can give us; with solid type propagation,
these kinds of transformations just fall out.
So, with this in mind, you will now augment all IR instructions to
include an
array of attributes. So, the generic IR form will now be:
v = OP(operands, attributes)
When you run SCCP on this, you also run SCCP on the attribute values.
This lets
you use attributes as additional meta-information to optimize that
instruction.
For example, consider this sequence of IR instructions:
v = 5
a = box_primitive(Fixnum, v, [])
a_attr = [BOXED_FIXNUM(v)]
...
... a is not used anywhere here ...
...
b = a
b_attr = a_attr
x = unbox_primitive(Fixnum, b, [b_attr])
y = x
When you run SCCP on this, here is what happens to the unbox_primitive
instruction:
x = unbox_primitive(Fixnum, b, [[BOXED_FIXNUM(v)]])
Thus, you know that you are unboxing a boxed value 'v'. So, you can
simplify this
instruction to:
x = v
So, when you run SCCP, the above sequence reduces to (I am excluding
some useless copies here)
v = 5
a = box_primitive(Fixnum, v, [])
...
y = 5
After you run dead code elimination, assuming a is not used anywhere,
the code simplifies to
v = 5
y = 5
Here's an attempt to show this using my loop example from above:
Basic pseudo-IR
v = 0
a = box_primitive(Fixnum, v, [])
b = ocall a, '<', [box_primitive(Fixnum, 5, [])]
if_false [b, :end_label]
top_label: # stuff
a' = ocall a, '+' [box_primitive(Fixnum, 1, [])]
b' = ocall a', '<', [box_primitive(Fixnum, 5, [])]
if_true [b', :top_label]
end_label: .....
After propagating types
v = 0
a(Fixnum) = box_primitive(Fixnum, v, [])
b(Boolean) = ocall_typed Fixnum, a, '<', [box_primitive(Fixnum, 5, [])]
if_false [b, :end_label]
top_label: # stuff
a'(Fixnum) = ocall_typed Fixnum a, '+' [box_primitive(Fixnum, 1, [])]
b'(Boolean) = ocall_typed Fixnum a', '<', [box_primitive(Fixnum, 5, [])]
if_true [b', :top_label]
end_label: .....
And reducing/inlining based on those propagated types:
a = 0
if_ge a, 5, :end_label
top_label: #stuff
a' = add a, 1
if_lt a, 5, :top_label
end_label:
As long as Fixnum's *operations* remain as expected, this is 100% safe
and legal.
Summary:
--------
This is a tremendous start! Thanks so much for jumping in with this. I
think we're on the right track.
And this list is a perfect place for us to start prototyping a few
simple localized examples...
1. Every IR instruction is basically of the form
v = OP(operand_array, attribute_array)
Since this is a high-level IR, you are not bound to a 3-operand format.
You can have an array of operands and an array of attributes.
But, you can also have IR instructions of the form:
v = OP(op1, attribute_array)
v = OP(op1, op2, attribute_array)
2. You have regular ALU kind of IR instructions (add, sub, cmp, lshift,
etc.)
3. You have various flavor of call instructions:
v = call(method, args)
v = ocall(receiver, method, args)
... and anything else you can think of ...
4. You have a method-address instruction:
m = meth_addr(receiver, method)
5. Instructions to extract function/method arguments and return function
values:
x = meth_arg(arg_index)
return y
6. You have various kinds of guard instructions:
type_guard(obj, class, L) <-- if type of obj is not class,
jump to L
class_guard(class, L) <-- if class has been
monkey-patched, jump to L
method_guard(class, method, L) <-- if class.method has been
monkey-patched, jump to L
7. You have box/unbox instructions for primitives:
n = box_primitive(class, v)
v = unbox_primitive(class, n)
8. You have attribute instructions for passing around meta-information
(types, etc.)
init_attr(a_attr, [])
add_attr(a_attr, X)
Note:
- a_attr is just naming convention. As far as representation is
concerned,
it is just another SSA IR variable.
- You need to identify the different types of meta-information you
might pass
around. The X above is a place-holder for that.
Some kinds of X are:
- BOXED_FIXNUM(n)
- BOXED_ARRAY(a)
- HAS_TYPE(class)
...
This is it for the 1st pass. More later with an example of closures.
---------------------------------------------------------------------
To unsubscribe from this list, please visit:
http://xircles.codehaus.org/manage_email