What's wrong is the 1,135,775 calls to "method 'poll' of
'select.epoll' objects".
I was affraid you are going to say that. :-)
With five browsers waiting for messages over 845 seconds, that works
out to each waiting browser inducing 269 epolls per second.
Almost equally important is what the problem is *not*. The problem is
*not* spending the vast majority of time in epoll; that's *good* news.
The problem is *not* that CPU load goes up linearly as we connect more
clients. This is an efficiency problem, not a scaling problem.
So what's the fix? I'm not a Tornado user; I don't have a patch.
Obviously Laszlo's polling strategy is not performing, and the
solution is to adopt the event-driven approach that epoll and Tornado
do well.
Actually, I have found a way to overcome this problem, and it seems to
be working. Instead of calling add_timeout from every request, I save
the request objects in a list, and operate a "message distributor"
service in the background that routes messages to clients, and finish
their long poll requests when needed. The main point is that the
"message distributor" has a single entry point, and it is called back at
given intervals. So the number of callbacks per second does not increase
with the number of clients. Now the CPU load is about 1% with one
client, and it is the same with 15 clients. While the response time is
the same (50-100msec). It is efficient enough for me.
I understand that most people do a different approach: they do a fast
poll request from the browser in every 2 seconds or so. But this is not
good for me, because then it can take 2 seconds to send a message from
one browser into another that is not acceptable in my case. Implementing
long polls with a threaded server would be trivial, but a threaded
server cannot handle 100+ simultaneous (long running) requests, because
that would require 100+ threads to be running.
This central "message distributor" concept seems to be working. About
1-2% CPU overhead I have to pay for being able to send messages from one
browser into another within 100msec, which is fine.
I could have not done this without your help.
Thank you!
Laszlo
--
http://mail.python.org/mailman/listinfo/python-list