Because Plucker takes dramatically longer than other scrapers on my system, 
I decided to profile it.  I'm a bit surprised by the results and would love 
some comment.  The results are probably very Windows-specific; I'm running 
Windows 2000.  The system used is on a pretty fast DSL with nothing else 
running at the time and an 800MHz processor.

The really big surprise to me was that most of the time is spent in 
communications.  About 4% was spent in "connect" calls, more than 26% on 
socket read calls, more than 7% on socket readline calls.  This despite it 
using extremely little and sporadic bandwidth on the network.

About 8% was on "message" status updates; clearly this isn't 
super-efficient in Windows, but I did have it on Verbose.  Most of the rest 
was spread out among text analysis tasks.

I'm not sure what, if any, attention this merits.  Certainly improving the 
socket speed would dramatically change the shape of the profile as would 
allowing several simultaneous threads retrieving pages simultaneously, but 
the sockets functions are at the Python lib and below level and Python 
doesn't appear (on cursory inspection) to support threading.  And while I 
could write a faster C lib for Python that could grab several pages at 
once, depositing them on-disk when ready, the effort involved leads to a 
more fundamental question:

Is this really a problem?

I looked into it because of a usenet post, not because it was bugging 
me.  I have no problem firing off a Pluck and doing something else for five 
minutes.   David is pretty adamant it's not a problem under Linux, so it 
could be Windows-specific.  Does it merit any real attention given that the 
solutions are all invasive to the elegance-of-architecture?

        Tony McNamara

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to