Will we plan making objects aligned by 8-bytes in Q3? AFAIU this is the only way to avoid lock prefix and performance degradation and does not require big changes in GC: we need to have objects have size of multiple of 8 and every memory area allocated by GC to be aligned by 8. Do I miss something here?
It can be less work then making temporary workarounds in JIT instead of simple XMM moves we already have. On 6/1/07, Pavel Ozhdikhin <[EMAIL PROTECTED]> wrote:
On 6/1/07, Weldon Washburn <[EMAIL PROTECTED]> wrote: > On 31 May 2007 00:52:00 +0400, Egor Pasko <[EMAIL PROTECTED]> wrote: > > > > On the 0x2E6 day of Apache Harmony Xiao-Feng Li wrote: > > > On 5/30/07, George Timoshenko <[EMAIL PROTECTED]> wrote: > > > > > > > > > I had a question in the JIRA about this issue: why don't we use > > "lock" > > > > > prefix for the atomic access? > > > > > > > > well... > > > > > > > > Originally we split all 64-bit memory access into 2 ones of 32-bit. > > > > It does not have sense to set #LOCK prefix for them. (there is a gap > > > > between) > > > > > > > > We can only set #LOCK to some instruction that reads/writes whole 64 > > bits. > > > > > > > > The bad thing is the only instruction (according to IA32 spec) we can > > > > set #LOCK to is CMPXCHG8B (MOVQ, MOVSD and any others can not be used > > > > with #LOCK) > > > > > > > > This monster (CMPXCHG8B) requires 4 registers: > > > > > > > > EAX > > > > EBX > > > > ECX > > > > EDX > > > > > > > > and (FLAGS) also. > > > > > > > > I am not sure CMPXCHG8B usage will be faster than making volatile > > fields > > > > always synchronized (artificially) > > > > > > George, I believe it should be much faster than synchronized block, > > > since it is non-blocking with contended locks. To use compxchg, you > > > need a loop to check the return result till it succeeds. With > > > synchronized block, the thread will go to sleep till being waken up by > > > the releasing thread. > > > > hm, if I am not mistaken most of the time that would be a spin lock > > with the current thread manager. So, I cannot not bet which way is > > faster. Maybe, some expert in TM can tell for sure? > > > This kind of stuff is always emprical. The task is to build, measure, post > the results. The wild cards are the workload and the hardware. Different > combos will lead to different conclusions. > > Having said the above, my hunch is to go with CMPXCHG8B for right now. The > main motivation is that this decouples register assignment from the jvm > thread subsystem thus makes things easier to debug. This is goodness. Also > running exhaustive studies of different workloads, different platforms is > not something of high value for a JVM at such an early stage of > development. In other words, do this analysis once we get real workloads > like specjappserver running. As already noted, it should be easy to > re-implement when the time is right. > > Interesting background material --- From Jeremy Manson's "The Java Memory > Model", POPL 2005, section 2.3 it says, "In order to allow for non-blocking > techniques that communicate between threads, we also want to allow the use > of _volatile_ variables to synchronize information between threads. The > properties of volatile variables arose from the need to provide a way to > communicate between threads without the overhead of ensuring mutual > exclusion." While this does not dictate a solution, it sort of suggests > using opcodes (lockxxx) instead of bytecodes (monenter/exit). Adding monenter/monexit pair in the place where the author of the code did not intended to put them may lead to deadlock. So, I'm +1 for prototyping with CMPXCHG8B first. Thanks, Pavel > > > Anyway, both implementations do not seem to be very hard, we could try > > both ways... > > > > -- > > Egor Pasko > > > > > > > -- > Weldon Washburn >
-- Mikhail Fursov
