Because Plucker takes dramatically longer than other scrapers on my system,
I decided to profile it. I'm a bit surprised by the results and would love
some comment. The results are probably very Windows-specific; I'm running
Windows 2000. The system used is on a pretty fast DSL with nothing else
running at the time and an 800MHz processor.
The really big surprise to me was that most of the time is spent in
communications. About 4% was spent in "connect" calls, more than 26% on
socket read calls, more than 7% on socket readline calls. This despite it
using extremely little and sporadic bandwidth on the network.
About 8% was on "message" status updates; clearly this isn't
super-efficient in Windows, but I did have it on Verbose. Most of the rest
was spread out among text analysis tasks.
I'm not sure what, if any, attention this merits. Certainly improving the
socket speed would dramatically change the shape of the profile as would
allowing several simultaneous threads retrieving pages simultaneously, but
the sockets functions are at the Python lib and below level and Python
doesn't appear (on cursory inspection) to support threading. And while I
could write a faster C lib for Python that could grab several pages at
once, depositing them on-disk when ready, the effort involved leads to a
more fundamental question:
Is this really a problem?
I looked into it because of a usenet post, not because it was bugging
me. I have no problem firing off a Pluck and doing something else for five
minutes. David is pretty adamant it's not a problem under Linux, so it
could be Windows-specific. Does it merit any real attention given that the
solutions are all invasive to the elegance-of-architecture?
Tony McNamara
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list