Re: [jvm-l] Battling the code size monster

Jim Baker Sat, 11 Sep 2010 16:25:30 -0700

On Sat, Sep 11, 2010 at 11:47 AM, Rémi Forax <[email protected]> wrote:


> Now that's an interesting thought. Ever since the JVM language summit and
> Cliff Click's analysis of some Jython code, it's been quite obvious that
> storing  every Python local name into the heap-allocated PyFrame associated
> with that function call is just a performance drain. Among other things, it
> defeats escape analysis. So it's good to think about this again.
>
>  Maybe if we did some sort of barrier where we only did the stores into
> the PyFrame, either before a downcall or before coming out of an exception,
> we could preserve Python frame semantics without always paying their cost.
> This would be especially true if we guarded downcalls such that those not
> calling back into Python don't actually invoke the barrier. So in particular
> this could mean something with tight loops, often similar to this contrived
> code
>
>  for i in xrange(1000):
>     for j in xrange(1000):
>         a = i * j
>         # ...
>
>  could get a lot cheaper, and ideally the JVM would elide away the
> allocations of PyInteger (wrapping int) - xrange certainly has no need to
> access the frame.
>
>
> Here, you should do a typechecking analysis to prove that i and j can
> be encoded as native java integer and generate two for loops.
>
> see page 8 of my presentation at last JVM Summit
> (http://wiki.jvmlangsummit.com/images/2/2a/Jvmsummit-phpreboot.pdf)
>
>
Certainly some nice performance numbers in PHP.reboot! Without invokedynamic
we have to rely on escape analysis and users doing the standard optimization
of localizing global names. (The global namespace can potentially be altered
at any time, by any thread having access to the same PySystemState, although
as we all know, it really isn't.)

But see below too.

>
>  We do something conceptually similar in our support of ContextGuard to
> minimize callpath costs in the with-statement. In particular, this allowed
> us to significantly reduce locking overhead; see Tobias Ivarsson's blog post
> on this
> http://journal.thobe.org/2009/07/improving-performance-in-jython.html
>
>  Also, we actually do barriers in our support for the Python yield
> statement in generators/coroutines, where we store the Java local variables
> (not visible to Python itself, these are the unnamed variables such as the
> iterator returned from the above xrange function) in the PyFrame so they can
> be reconstituted upon the next iteration of the generator.
>
>  So this approach is something we should definitely look at - it would
> require minimal hacking on the current compiler, plus the guard support.
>
>
> I'm not a big fan of guards without invokedynamic.
> You usually end up by generating the fast path *and* the slow path
> and reach the maximum code size.
>

The guard I'm thinking of here would just require an typecheck on the
PyObject through which the downcall is being dispatched; only if that object
does not implement a marker interface (say NoFrame) will any outstanding
Java locals be written to PyFrame. So it will not inflate any code. We also
don't need to read the PyFrame upon return from the call (that's fine with
Python semantics).

What's nice is that we could definitely implement this in our current
release cycle (Jython 2.5.2), it just needs a very small amount of tracking
of local assignments to be added.


> Of course, it's ok, if you are able to bailout into an interpreter for
> slow/uncommon cases.
>
> Which is definitely doable in Jython 2.6. The Python bytecode VM already
support jumps to an arbitrary instruction, again for generator support. And
it uses the same PyFrame structure. So we just need to throw an exception
upon a violation of any typing constraints, save locals in the frame on the
way out, and then resume within the interpreted version. It should be just
that straightforward.

So it sounds like our PBC VM might end up being the ordinary way we execute
code in Jython. That's a rather ironic twist!

>
>>  As you already know, I don't think that an IR-based compiler
>> is a good idea. javac is not an IR-based compiler and
>> Java is fast.
>>
>>
>  I think we have been convincing ourselves that we need an IR-based
> compiler. Tobias worked on one using SSA, but it proved to be a fair amount
> of work, and it has not been completed. Jython's current compiler is very
> simple/naive - AST directed, with the modest amount of scope analysis
> required for Python's lexical scoping (approx 360 lines).
>
>
> And what about adding a typechecking analysis.
> After all, the bytecode is statically typed.
>
>
>
>  It has been pretty easy to hack on this compiler, whether a new language
> feature or the occasional bug fix. This is because of the strong degree of
> locality and easy correspondence from broken unit test to the resulting code
> path.
>
>  Remi, you may be right - maybe we just need to continue to generate naive
> bytecode and let the JVM sort it out. Just somewhat better naive bytecode,
> that's all.
>
>
> It depends what 'naive' means. Java is fast because the VM optimize
> common bytecode patterns.
> So if you generate the same code as javac, I mean literally,
> then you will have performance :)
>

And that's why the type profiling matters so much here. Then we can convert
an idiomatic xrange iteration into a for loop. (Or for that matter, range.)
In general, I would think a lot of idiomatic Python code can be readily
rewritten to be Java idiomatic.

>
> That why I think that creating a compiler that uses a SSA form
> is a bad idea. SSA form is used in the backend of traditional compiler
> (I mean since 2000) to be able to generate efficient assembler.
> If you are on top of a JVM, the JIT already does SSA optimizations.
>

Makes total sense to me.

>
> Your runtime environment should:
>  - a type profiler that record type of variables
>    Just separate objects from primitives can be sufficient.
>  - a taken/untaken branches recorder
>    to avoid type pollution from untaken branches.
>  - a coarse grained live variable analysis
>     to get a use-define info of any variables.
>  - a data-flow analyzer that propagate exceptional events:
>     a function does reflection on stack frames, etc.
>  - a fast-forward typechecking analysis
>     you can prove type assumptions derived from the type profiler.
>     (This will avoid to generate guards where not necessary)
>
> Your generator should:
>  - use previously calculated types to generate a specialized code for a
> function
>  - generate code-preamble and post-code instructions
>    to go back and forth between the interpreter and the generated code.
>  - generate inlining cache (or better share inlining cache with the
> interpreter :)
>  - generate code to go back into the interpreter if an unusual thing
> happens.
>    an untaken branch is taken, etc
>  - intrinsics some common constructs.
>    A call to math.sin() should be a direct call, etc.
>  - inline bloc of code like closures, remove yield, etc.
>
> I'm surely forget something.
> Also some items can be hard without invokedynamic/method handle
> or the corresponding abstractions
> (mutable polymorphic callsites and function pointers)
>
>
Jython 2.6 will be using invokedynamic, either using Java 7 or your backport
work for Java 6. I like all your suggestions, they seem generally applicable
to Jython. The key insight is that interpreters are extremely useful for
dynamic languages on the JVM!

- Jim

-- 
You received this message because you are subscribed to the Google Groups "JVM 
Languages" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/jvm-languages?hl=en.

Re: [jvm-l] Battling the code size monster

Reply via email to