John, This is useful, thanks. Probably more questions will follow after doing more homework.
Mandy > On Aug 24, 2016, at 10:07 AM, John Rose <john.r.r...@oracle.com> wrote: > > On Aug 22, 2016, at 9:30 PM, Mandy Chung <mandy.ch...@oracle.com> wrote: >> >> We need to follow up this issue to understand what the interpreter and >> compiler do for this unused slot and whether it’s always zero out. > > These slot pairs are a curse, in the same league as endian-ness. > > Suppose a 64-bit long x lives in L[0] and L[1]. Now suppose > that the interpreter (as well it might) has adjacent 32-bit words > for those locals. There are four reasonable conventions for > apportioning the bits of x into L[0:1]. Call HI(x) the arithmetically > high part of x, and LO(x) the other part. Also, call FST(x) the > lower-addressed 32-bit component of x, when stored in memory, > and SND(x) the other part. Depending on your machine's > endian-ness, HI=FST or HI=SND (little-endian, x86). > For portable code there are obviously four ways to pack L[0:1]. > I've personally seen them all, sometimes as hard-to-find VM bugs. > > We're just getting started, though. Now let the interpreter generously > allocate 64 bits to each local. The above four cases are still possible, > but now we have 4 32-bit storage units to play with. That makes > (if you do the math) 4x3=12 more theoretically possible ways to > store the bits of x into the 128 bits of L[0:1]. I've not seen all 12, > but there are several variations that HotSpot has used over time. > > Confused yet? There's more: All current HotSpot implementations > grow the stack downward, which means that the address of L[0] > is *higher* than L[1]. This means that the pair of storage units > for L[0:1] can be viewed as a memory buffer, but the bits of L[1] > come at a lower address. (Once we had a tagged-stack interpreter > in which there were extra tag words between the words of L[0] > and L[1], for extra fun. We got tired of that.) > > There's one more annoyance: The memory block located at L[0:1] > must be at least 64 bits wide, but it need not be 64-bit aligned, > if the size of a local slot is 32 bits. So on machines that cannot > perform unaligned 64-bit access, the interpreter needs to load > and store 64-bit values as 32-bit halves. But we can put that > aside for now; that's a separable cost borne by 32-bit RISCs. > > How do we simplify this? For one thing, view all reference > to HI and LO with extreme suspicion. That goes for misleadingly > simple terms like "the low half of x". On Intel everybody > knows that's also FST (the first memory word of x), and > nods in agreement, and then when you port to SPARC > (that was my job) the nods turn into glassy-eyed stares. > > Next, don't trust L[0] and L[1] to work like array elements. > Although the bytecode interpreter refers directly to L[0] > and indirectly to L[1], when storing 'x', realize that you > don't know exactly how those guys are laid out in memory. > The interpreter will make some local decision to avoid > the obvious-in-retrospect bug of storing 64-bits to L[0] > on a 32-bit machine. The decision might be to form the > address of L[1] and treat *that* as the base address of > a memory block. The more subtle and principled thing > to do would be to form the address of the *end* of L[0] > and treat that as the *end* address of a memory block. > The two approaches are equivalent on 32-bit machine, > but on a 64-bit machine one puts the payload only > in L[1] and one only in L[0]. > > Meanwhile, the JIT, with its free-wheeling approach > to storage allocation, will probably try its best to ignore > and forget stupid L[1], allocating a right-sized register > or stack slot for L[0]. > > Thus interpreter and JIT can have independent internal > conventions for how they assign storage units to L[0:1] and > how they use those units to store a 64-bit value. Those > independent schemes have to be reconciled along mode > change paths: C2I and I2C adapters, deoptimization, and > on-stack replacement (= reoptimization). > > The vframe_hp code does this. A strong global convention > would be best, such as always using L[0] and always storing > all of x in L[0] if it fits, else SND(x) in L[0] and FST(x) in L[1]. > I'm not sure (and I doubt) that we are actually that clean. > > Any reasonable high-level API for dealing with this stuff > will do like the JIT does, and pretend that, whatever the > size of L[0] is physically, it contains the whole value assigned > to it, without any need to inspect L[1]. That's the best policy > for virtualizing stack frames, because it aligns with the > plain meaning of bytecodes like "lload0", which don't mention > L[1]. The role of L[1] is to provide "slop space" for internal > storage in a tiny interpreter; it has no external role. The > convention used in HotSpot and the JVM verifier is to > assign a special type to L[1], "Top" which means "do not > look at me; I contain no bits". A virtualized API which > produces a view on such an L[1] needs to return some > default value (if pressed), and to indicate that the slot > has no payload. > > HTH > > — John > > P.S. If all goes well with Valhalla, we will probably get > rid of slot pairs altogether in a future version of the JVM > bytecodes. They spoil generics over longs and doubles. > The 32-bit implementations of JVM interpreters will have > to do extra work, such as have 64-bit slot sizes for methods > that work with longs or doubles, but it's worth it. >