Re: Fastest way to retrieve and write html contents to file

DFS Mon, 02 May 2016 00:43:31 -0700

On 5/2/2016 2:27 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:

startTime = time.clock()
for i in range(loops):
        r = urllib2.urlopen(webpage)
        f = open(webfile,"w")
        f.write(r.read())
        f.close
endTime = time.clock()
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)


Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops? That's less than half asfast as my results. Surprising.

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.


Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerfullanguage. I'm extremely impressed by how tight the code can get.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all.


True.  And it has been my assumption - tho not with 10MB file.

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant.


Good point.  Test below.

If you believe otherwise, demonstrate it.


http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes),so this is 16.6x larger. So I would expect python to linearly run in16.6 * 0.88 = 14.6 seconds.


10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75seconds - way too fast, so I checked and no data was written to file,and I couldn't even open the webpage with a browser. Looks like I hadbeen temporarily blocked from the site. After a couple minutes, I wasable to access it again).

I noticed urllib and curl returned the html as is, but urllib2 andrequests added enhancements that should make the data easier to parse.Based on speed and functionality and documentation, I believe I'll beusing the requests HTTP library (I will actually be doing a small amountof web scraping).



VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.


--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

Reply via email to