Hi Peter,

> >  1. cpu_unlink_tb (exec.c)
> 
> This function is broken even for pure TCG -- we know it has a race condition.
> As I said on IRC, I think that the right thing to do is to start
> by overhauling the current TCG code so that it is:
>  (a) properly multithreaded (b) race condition free (c) well documented
>  (d) clean code
> Then you have a firm foundation you can use as a basis for the LLVM
> integration (and in the course of doing this overhaul you'll have
> figured out enough of how the current code works to be clear about
> where hooks for invalidating your traces need to go).

  I must say I totally agree with you on overhauling the current TCG code. But
my boss might have no such patient on this. ;) If there is a plan out there, 
I'll
be very happy to join in.

  I read the thread talking about the broken tb_unlink [1], and I'm surprised
that tb_unlink is broken even under single-threaded mode and system mode. You
mentioned (b) could be the IO thread in [1]. I think we don't enable IO thread
in system mode right now. My concern is if I spot _all_ place/situation that I
need to break the link between block and trace. 
 
> > The big problem is debugging.
> 
> Yes. In this sort of hotspot based design it's very easy to end up
> with bugs that are intermittent or painful to reproduce and where
> you have very little clue about which version of the code for which
> address ended up misgenerated (since timing issues mean that what
> code is recompiled and when it is inserted will vary from run to
> run). Being able to conveniently get rid of some of this nondeterminism
> is vital for tracking down what actually goes wrong.

  Misgenerated code might not be an issue now since we have tested our framework
in LLVM-only mode. I think the problem still is about the link/unlink stuff.
The first problem I have while lowering the threshold is the broken one generate
a few traces (2, actually) that a work one doesn't. When boot the linux image
downloaded from the QEMU website, the system hangs on the booting process (see
attach if you're interested). Simply put, the system hangs after printing

  ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1

which turns out should be function check_timer (arch/i386/kernel/io_apic.c). I
am not a Linux kernel expert and have no idea how to solve this. The culprit
traces beginning with 0xc01111b8 and 0xc01111d7. Here is their corresponding
guest binary.

----------------
IN:
0xc01111b8:  add    0xc04fa798,%eax
0xc01111be:  mov    (%eax),%eax
0xc01111c0:  ret

----------------
IN:
0xc01111d7:  mov    $0x108,%eax
0xc01111dc:  call   0xc01111b8

I compile the linux kernel with debug info and without inline function, then
objdump vmlinux to see what the source code might be. I guess because 
linux-0.2.img
has other stuff besides vmlinux (kernel image), the addresses above can only be
used as an approximation or even useless. I only find one spot having the same
code sequence (I believe) as 0xc01111b8 but can't find the other one so far.
See below,

static inline unsigned int readl(const volatile void __iomem *addr)
{
        return *(volatile unsigned int __force *) addr;
c0214a90:       03 05 44 56 4f c0       add    0xc04f5644,%eax
c0214a96:       8b 00                   mov    (%eax),%eax
#define FSEC_TO_USEC (1000000000UL)

int hpet_readl(unsigned long a)
{
        return readl(hpet_virt_address + a);
}
c0214a98:       c3                      ret

  This is the whole story so far. :-) Any comment are welcome!

[1] http://lists.gnu.org/archive/html/qemu-devel/2011-11/msg02447.html

Regards,
chenwj

-- 
Wei-Ren Chen (陳韋任)
Computer Systems Lab, Institute of Information Science,
Academia Sinica, Taiwan (R.O.C.)
Tel:886-2-2788-3799 #1667
Homepage: http://people.cs.nctu.edu.tw/~chenwj

Reply via email to