Re: The Great Startup Problem

Fredrik Öhrström Mon, 25 Aug 2014 05:00:44 -0700

As I said (many times) before, the methodhandles/indy should be built using
a minimal interpreter supported calling convention:


invoke (that can box/unbox at runtime)

All methodhandles should then be written as static bytecode working with
objects (no dynamic generation of specialized bytecode) see the reference
that Marcus already gave:
https://blogs.oracle.com/ohrstrom/entry/pulling_a_machine_code_rabbit

Then specialization happens at JIT time. The JIT is the most efficient
place for specialization, in fact specialization is the raison de vivre for
the JIT.

The JIT must support boxing removal and constant propagation and can be
tuned to deal with method handle chains, these are in fact simpler than the
standard generic inline situation.

Then the JIT has a golden opportunity to inline the methodhandle chain, and
perhaps even the target into the indy callsite. Tons of information will be
final and usable for specialization.

Calle Wilund and I implemented such a indy/methodhandle solution for
JRockit, so I know it works. You can see a demonstration here:
http://medianetwork.oracle.com/video/player/589206011001 That
implementations jump to C-code that performed the invoke call, no fancy
optimizations. Though the interpreter implementation of invoke can be
optimized as well, that was the first half of the talk is about. But its
really not that important for speed, because the speed comes from inlining
the invoke call chain as early as possible after detecting that an indy is
hot.

//Fredrik





2014-08-25 11:32 GMT+02:00 Marcus Lagergren <[email protected]>:

> Regarding indy dense code:
>
> It is certainly a problem both for JRuby with indy and Nashorn with indy
> that indy scalability is so bad in 9 builds with the current JITs. I
> suspect that as Java 8 grows as a code base and as a language, it will turn
> into a problem with Java 8 lambdas too. Nashorn generates a lot more code
> to pick the correct time and generate faster code, this means a lot more
> indys. This means a lot more lambdaforms. This means a lot more metaspace.
> This means a lot longer warmup. And lambdaform code that never really has a
> chance to be properly optimized - sometimes just simply because the JIT
> stops inlining, or sometimes because java.lang.invoke is full of boxing and
> arraycopies that simple don’t go away.
>
> As Charlie pointed out, an invokedynamic callsite is generated as a
> seperate method in a separate class (albeit anonymous), which eventually
> loads up the metaspace with tremendous amounts of stuff. Sergey Kuksenko
> had a very interesting performance analysis presentation at JVMLS this year
> where ~41% of his runtime for Nashorn with octane.box2d was unlined lambda
> forms. And this is basically just the mechanisms pushing parameters and
> applying filters around the callsite. Seems like lambdaforms have to be
> treated specially (or rather indy callsites) by the JIT.
>
> One solution that was proposed for 8u40 was JEP210 (lambda form caching),
> which does indeed keep footprint down, but performance suffers mightily
> since the same lambdaform snippet kan now be used at two completely
> different call sites, which brings us cache pollution. Vladimir is on
> vacation, but even though this brings metaspace down, I don’t think
> performance is back yet.
>
> Even if everything inlinines correctly and deeply (which doesn’t happen
> all he time in C2 for long chains), we still have the problem of holding on
> to this synthetic bytecode/metaspace constructs for the LambdaForms. We
> really don’t want to have all this bookkeeping around something that can be
> as simple as permuting a couple of parameters (yes, it can be more complex,
> same argument applies)
>
> LambdaForms were most likely introduced as a platform independent way of
> implementing methodhandle combinators in 8, because the 7 native
> implementation was not very stable, but it was probably a mistake to add
> them as “real” classes instead of code snippets that can just be spliced in
> around the callsite. (I completely lack history here, so flame me if I am
> wrong)
>
> For both JRuby and Nashorn in the indy world, starting up a process
> generates bytecode where say every 5th to 10th instruction is an indy.
> Lambda code is not that bad, but it can also look pretty hairy. Now, if
> runtime linkage for each of these callsites requires metaspace, hidden
> bytecode generation, anonymous internal classes and the rest of the
> combinatorial explosion Charlie describes, we are setting us up for really
> bad scalability on such an arena. And metaspace of course, goes through the
> roof. Custom runtime linkage is still slow, but at least it only happens
> once.We don’t want to keep adding even more overhead o that.
>
> For 9, it seems that we need a way to implement an indy that doesn’t
> result in class generation and installation of anonymous runtime classes.
> Note that _class installation_ as such is also a large overhead in the JVM
> - even more so when we regenerate more code to get more optimal types. I
> think we need to move from separate classes to inlined code, or something
> that requires minimium bookkeeping. I think this may be subject to profile
> pollution as well, but I haven’t been able to get my head around the
> ramifications yet.
>
> There are various problems here as well (for example, several of the
> java.lang.invoke combinators create boxing and arrays and do arraycopies),
> stuff that would needed to be optimized away, or it’ll punish any indy
> call. In such an environment we can cheat with annotations like
> @ExplodeThisArrayToLocals or @NoSafePoint or similar magic annotations,
> because after all we own the code we splice in. (Solving local escape
> analysis, which is really the problem in the generic form of callsite IR,
> has so far not been very successful in C2), but even if C2 is a little bit
> legacy, making things like this hard, we might still be able to cheat for
> the limited world/range that is an indy callsite and teach C2 some magic.
> Lots of early performance problems Attila and I had in Nashorn were from
> e.g. MethodHandles.catchExcetption, that had to be rewritten to avoid
> boxing, but I’m talking about a more generic mechanism than this.
>
> Having said this, I don’t think that we can solve the indy scalability
> problems in the current jits, without getting away from the class
> generatation/bytecode spewing that results from an indy callsite being
> compiled. Caching lambda forms brings the memory footprint down, but I am
> already quite worried that it will get nowhere near the performance that is
> needed, due to profile pollution.
>
> If 9 is a platform that supports indy and runs c1 and c2, invokedynamic
> callsites in the JVM, at least in C2, would need some serious love -
> perhaps as described above.
>
> Paul, Vladimir, Rickard - do you have any comments? We had a good
> discussion a couple of weeks ago about profiling callsites in SCA and what
> to do with such callsites. I’d prefer it if one of you guys write down a
> bit of our thoughts from that session, as I am again afraid of making a
> damn fool of myself among genius engineers on this list.  Also cc:ing
> Fredrik.
>
> /M
>
> On 25 Aug 2014, at 10:07, Jochen Theodorou <[email protected]> wrote:
>
> > Am 24.08.2014 20:33, schrieb Charles Oliver Nutter:
> >> On Sun, Aug 24, 2014 at 12:55 PM, Jochen Theodorou <[email protected]>
> wrote:
> >>> afaik you can set how many times a lambda form has to be executed
> before it
> >>> is compiled... what happens if you set that very low... like 1 and
> disable
> >>> tiered compilation?
> >>
> >> Forcing all handles to compiler early has the same negative
> >> effect...most are only called once, and the overhead of reifying them
> >> outweighs the cost of interpreting them.
> >>
> >> I need to play with it more, though. The property I think you're
> >> referring to did not appear to help us much.
> >
> > I see it as a tradeoff. Yes, one-time-visited callsites may run even
> slower with this, but I think that is to be measured first. And secondly,
> you will be up to speed much faster than before, which can maybe outweight
> the initial cost. I am not saying 1 is an ideal value, but it should be
> played with.
> >
> >>>> We obviously still love working with OpenJDK, and it remains the best
> >>>> platform for building JRuby (and other languages). However, our
> >>>> failure as a community to address these startup/warmup issues is
> >>>> eventually going to kill us. Startup time remains the #1 complaint
> >>>> about JRuby, and warmup time may be a close second.
> >>>
> >>> how do normal ruby startup times compare to JRuby for a rails app?
> >>
> >> Perhaps 10x faster startup across the board in C Ruby. With tier 1 we
> >> can get it down to 5x or so. It's incredibly frustrating for our
> >> users.
> >
> > I guess for a rails app that is indeed pretty bad.
> >
> >>> All in all, the situation is for the Groovy world quite different I
> would
> >>> say.
> >>
> >> I'd guess that developers in the Groovy world typically do all their
> >> development in an IDE, which can keep a runtime environment available
> >> all the time. Contrast this to pretty much everyone not from a Java or
> >> C# background, where their IDE is a text editor and a command line.
> >
> > Now I feel almost insulted ;) I get scolded so often, that I treat my
> IDE only as a better text editor... I agree in general though.
> > I think this is not so much a Groovy thing, as more a java thing though.
> If you do Grails, you do Spring+Apache most of the time. So you don't start
> a new server, you deploy to it. And even that may (in development mode)
> work by just keeping the class files in a certain directory. Unit testing
> is maybe different. But even there, you don't start a new JVM for each
> test. Maybe not even for each test suite. Groovy generally goes with the
> JVM instance here. Actually it is not even easily possible to spawn
> separate Groovy environments in the same JVM. In Grails a new environment
> might be spawned on a per suite base.
> >
> > So yes, there are instances kept around, but imho this is already done
> from the Java world. We do nothing special here most of the time. But of
> course this is related to slow startup speeds of the JVM. groovy-core has
> around 7k tests, if for each of them we would have to create a new JVM it
> would easily take over an hour to execute. With Groovy startup included
> probably more than 6 hours.
> >
> > Yes, this is a result of the great startup problem. But, the Java
> community finds ways around. The problem is that in JRuby you have to try
> to force a Ruby mechanism onto the JVM. And this works properly only if the
> JVM can behave as much as the Ruby as needed. And in regards to the startup
> times it does surely not.
> >
> > bye Jochen
> >
> > --
> > Jochen "blackdrag" Theodorou - Groovy Project Tech Lead
> > blog: http://blackdragsview.blogspot.com/
> > german groovy discussion newsgroup: de.comp.lang.misc
> > For Groovy programming sources visit http://groovy-lang.org
> >
> > _______________________________________________
> > mlvm-dev mailing list
> > [email protected]
> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
>

_______________________________________________
mlvm-dev mailing list
[email protected]
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Re: The Great Startup Problem

Reply via email to