As far as Queues go, the adding/popping is apparently done with deque which are implemented in C. The Queue class pretty much just provides blocking operations and is otherwise a very thin layer around deque. As far as primitives go, only threading.Lock is written in C and the others are pure Python, so they're not that fast, which might be a reason for Queue's slowness.
As far as writing a custom C module, you could probably leave most of the work to deque and just implement blocking. If you stick to a simple lock primative, you can keep it portable and use Python's abstracted thread interface with an amazing choice of two whole functions: acquire and release. -- http://mail.python.org/mailman/listinfo/python-list