Hi Qiong, It's good that we've agreed on hardware issues, now we can discuss the method by itself in details. I read the Chilimbi's paper and think it can really work. But there are some issues in article that are unclear to me and I want to discuss here.
1) Chilimbi's algorithm does not use hardware counters at all. You said that you want to "use the hardware performance counter to get the data reference trace". I agree that PMU counters could help to optimize the algorithm (e.g. we can do trace profiling only for hot methods with high cache-miss rate) but I do not understand how do you want to collect a trace with PMU counters? The usage of PMU counters in programs I know is time or event based sampling. Sampling means that you get only 1 event of thousands: how to get the trace in this case? But anyway PMU could be useful to avoid profiling in a methods with low cache-miss rate. 2) Chilimbi does not investigate his algorithm behaviour when GC moves objects in memory. But once profiling is active the whole time of the program execution I think that trace cache will be tuned very fast to the new object locations. 3) I do not understand how Chilimbi's algorithm deals memory access in loops (example is an iteration of LinkedList). SEQUITUR examples are very smart when you have an alphabet with a limited number of letters. When you access the 10000 or more memory addresses from inside of a loop how the trace will look like? -- Mikhail Fursov