subject:"Re\: Fastest way to retrieve and write html contents to file"


On 5/3/2016 12:06 AM, Michael Torrie wrote:


Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.


Be glad to.  Helps me learn python, so bring whatever challenge you want 
and I'll try to keep up.


One small comparison I was able to make was VBA vs python/pyodbc to 
summarize an Access database.  Not quite a fair test, but interesting 
nonetheless.


---

Access 2003 file
Access 2003 VBA code

2,099,101 rows
114 tables  (max row = 600288)
971 columns
  text:  503
  boolean:   4
  numeric:   351
  date-time: 108
  binary:5
309 indexes (25 foreign keys)
333,549,568 bytes on disk
Time: 0.18 seconds

---

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6

2,099,101 rows
114 tables (max row = 600288)
971  columns
  text:  503
  numeric:   351
  date-time: 108
  binary:5
  boolean:   4
309 indexes (foreign keys na via ODBC*)
333,549,568 bytes on disk
Time: 0.49 seconds

* the Access ODBC driver doesn't support
  the SQLForeignKeys function

---

--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Michael Torrie

On 05/02/2016 01:37 AM, DFS wrote:
> So python matches or beats VBScript at this much larger file.  Kewl.

If you download something large enough to be meaningful, you'll find the
runtime speeds should all converge to something showing your internet
connection speed.  Try downloading a 4 GB file, for example.  You're
trying to benchmark an io-bound operation.  After you move past the very
small and meaningless examples that simply benchmark the overhead of the
connection building, you'll find that all languages, even compiled
languages like C, should run at the same speed on average.  Neither VBS
nor Python will be faster than each other.

Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 10:00 PM, Chris Angelico wrote:

On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:

On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).



100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.


How many seconds was it for 10 loops?

ChrisA


~0.44


--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico

On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:
> On 5/2/2016 3:19 AM, Chris Angelico wrote:
>
>> There's an easier way to test if there's caching happening. Just crank
>> the iterations up from 10 to 100 and see what happens to the times. If
>> your numbers are perfectly fair, they should be perfectly linear in
>> the iteration count; eg a 1.8 second ten-iteration loop should become
>> an 18 second hundred-iteration loop. Obviously they won't be exactly
>> that, but I would expect them to be reasonably close (eg 17-19
>> seconds, but not 2 seconds).
>
>
> 100 loops
> Finished VBScript in 3.953 seconds
> Finished VBScript in 3.608 seconds
> Finished VBScript in 3.610 seconds
>
> Bit of a per-loop speedup going from 10 to 100.

How many seconds was it for 10 loops?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).


100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.



Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:



import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA



Thanks for the good feedback.


--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 4:42 AM, Peter Otten wrote:

DFS wrote:


Is VB using a local web cache, and Python not?


I'm not specifying a local web cache with either (wouldn't know how or
where to look).  If you have Windows, you can try it.


I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with


Option Explicit
Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile  = "D:\econpy001.html"
startTime = Timer
For i = 1 to 10
Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
xmlHTTP.Open "GET", webpage


  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"



Tried that, and from later on that stackoverflow page:

xmlHTTP.setRequestHeader "Cache-Control", "private"

Neither made a difference.  In fact, I saw faster times than ever - as 
low as 0.41 for 10 loops.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Tim Chase

On 2016-05-02 00:06, DFS wrote:
> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for
> 10 iterations, vs 0.88 for python.

In addition to the other debugging recommendations in sibling
threads, a couple other things to try:

1) use a local debugging proxy so that you can compare the headers to
see if anything stands out

2) in light of #1, can you confirm/deny whether one is using gzip
compression and the other isn't?

-tkc

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Peter Otten

DFS wrote:

>> Is VB using a local web cache, and Python not?
> 
> I'm not specifying a local web cache with either (wouldn't know how or
> where to look).  If you have Windows, you can try it.

I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with

> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
> Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
> xmlHTTP.Open "GET", webpage
  
  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"

> xmlHTTP.Send
> Set fso = CreateObject("Scripting.FileSystemObject")
> Set fOut = fso.CreateTextFile(webfile, True)
> fOut.WriteLine xmlHTTP.ResponseText
> fOut.Close
> Set fOut= Nothing
> Set fso = Nothing
> Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime -
> startTime,3) & " seconds"
> ---
> save it to a .vbs file and run it like this:
> $cscript /nologo filename.vbs
> 


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Stephen Hansen

On Mon, May 2, 2016, at 12:37 AM, DFS wrote:
> On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> > I'm again going back to the point of: its fast enough. When comparing
> > two small numbers, "twice as slow" is meaningless.
> 
> Speed is always meaningful.
> 
> I know python is relatively slow, but it's a cool, concise, powerful 
> language.  I'm extremely impressed by how tight the code can get.

I'm sorry, but no. Speed is not always meaningful. 

It's not even usually meaningful, because you can't quantify what "speed
is". In context, you're claiming this is twice as slow (even though my
tests show dramatically better performance), but what details are
different?

You're ignoring the fact that Python might have a constant overhead --
meaning, for a 1k download, it might have X speed cost. For a 1meg
download, it might still have the exact same X cost.

Looking narrowly, that overhead looks like "twice as slow", but that's
not meaningful at all. Looking larger, that overhead is a pittance.

You aren't measuring that.

> > You have an assumption you haven't answered, that downloading a 10 meg
> > file will be twice as slow as downloading this tiny file. You haven't
> > proven that at all.
> 
> True.  And it has been my assumption - tho not with 10MB file.

And that assumption is completely invalid.

> I noticed urllib and curl returned the html as is, but urllib2 and 
> requests added enhancements that should make the data easier to parse. 
> Based on speed and functionality and documentation, I believe I'll be 
> using the requests HTTP library (I will actually be doing a small amount 
> of web scraping).

The requests library's added-value is ease-of-use, and its overhead is
likely tiny: so using it means you spend less effort making a thing
happen. I recommend you embrace this. 

> VBScript
> 1st run: 7.70 seconds
> 2nd run: 5.38
> 3rd run: 7.71
> 
> So python matches or beats VBScript at this much larger file.  Kewl.

This is what I'm talking about: Python might have a constant overhead,
but looking at larger operations, its probably comparable. Not fast,
mind you. Python isn't the fastest language out there. But in real world
work, its usually fast enough.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

On 5/2/2016 2:27 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

Yeah on my system I get 1.8 out of this, amounting to 0.18s.

You get 1.8 seconds total for the 10 loops? That's less than half as
fast as my results. Surprising.

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.

Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful
language. I'm extremely impressed by how tight the code can get.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all.

True. And it has been my assumption - tho not with 10MB file.

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant.

Good point. Test below.

If you believe otherwise, demonstrate it.

http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes),
so this is 16.6x larger. So I would expect python to linearly run in
16.6 * 0.88 = 14.6 seconds.

10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75
seconds - way too fast, so I checked and no data was written to file,
and I couldn't even open the webpage with a browser. Looks like I had
been temporarily blocked from the site. After a couple minutes, I was
able to access it again).

I noticed urllib and curl returned the html as is, but urllib2 and
requests added enhancements that should make the data easier to parse.
Based on speed and functionality and documentation, I believe I'll be
using the requests HTTP library (I will actually be doing a small amount
of web scraping).

VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file. Kewl.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico

On Mon, May 2, 2016 at 4:47 PM, DFS  wrote:
> I'm not specifying a local web cache with either (wouldn't know how or where
> to look).  If you have Windows, you can try it.
> ---
> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
>  Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
>  xmlHTTP.Open "GET", webpage
>  xmlHTTP.Send
>  Set fso = CreateObject("Scripting.FileSystemObject")
>  Set fOut = fso.CreateTextFile(webfile, True)
>   fOut.WriteLine xmlHTTP.ResponseText
>  fOut.Close
>  Set fOut= Nothing
>  Set fso = Nothing
>  Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) &
> " seconds"
> ---

There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).

Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:

import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 2:05 AM, Steven D'Aprano wrote:

On Monday 02 May 2016 15:00, DFS wrote:


I tried the 10-loop test several times with all versions.

The results were 100% consistent: VBSCript xmlHTTP was always 2x faster
than any python method.



Are you absolutely sure you're comparing the same job in two languages?


As near as I can tell.  In VBScript I'm actually dereferencing various 
objects (that adds to the time), but I don't do that in python.  I don't 
know enough to even know if it's necessary, or good practice, or what.






Is VB using a local web cache, and Python not?


I'm not specifying a local web cache with either (wouldn't know how or 
where to look).  If you have Windows, you can try it.

---
Option Explicit
Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile  = "D:\econpy001.html"
startTime = Timer
For i = 1 to 10
 Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
 xmlHTTP.Open "GET", webpage
 xmlHTTP.Send
 Set fso = CreateObject("Scripting.FileSystemObject")
 Set fOut = fso.CreateTextFile(webfile, True)
  fOut.WriteLine xmlHTTP.ResponseText
 fOut.Close
 Set fOut= Nothing
 Set fso = Nothing
 Set xmlHTTP = Nothing
Next
endTime = Timer
wscript.echo "Finished VBScript in " & FormatNumber(endTime - 
startTime,3) & " seconds"

---
save it to a .vbs file and run it like this:
$cscript /nologo filename.vbs



Are you saving files with both
tests? To the same local drive? (To ensure you aren't measuring the
difference between "write this file to a slow IDE hard disk, write that file
to a fast SSD".)


Identical functionality (retrieve webpage, write html to file).  Same 
webpage, written to the same folder on the same hard drive (not SSD).


The 10 file writes (open/write/close) don't make a meaningful difference 
at all:

VBScript 0.0156 seconds
urllib2  0.0034 seconds

This file is 3.55K.



Once you are sure that you are comparing the same task in two languages,
then make sure the measurement is meaningful. If you change from a (let's
say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if
you increase it to a 1 KB file?


Do you know a webpage I can hit 10x repeatedly to download a good size 
file?  I'm always paranoid they'll block me thinking I'm a 
"professional" web scraper or something.


Thanks


--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
> startTime = time.clock()
> for i in range(loops):
>   r = urllib2.urlopen(webpage)
>   f = open(webfile,"w")
>   f.write(r.read())
>   f.close
> endTime = time.clock()  
> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

Yeah on my system I get 1.8 out of this, amounting to 0.18s. 

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all. 

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant. 

If you believe otherwise, demonstrate it.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Steven D'Aprano

On Monday 02 May 2016 15:00, DFS wrote:

> I tried the 10-loop test several times with all versions.
> 
> The results were 100% consistent: VBSCript xmlHTTP was always 2x faster
> than any python method.

Are you absolutely sure you're comparing the same job in two languages? Is 
VB using a local web cache, and Python not? Are you saving files with both 
tests? To the same local drive? (To ensure you aren't measuring the 
difference between "write this file to a slow IDE hard disk, write that file 
to a fast SSD".)

Once you are sure that you are comparing the same task in two languages, 
then make sure the measurement is meaningful. If you change from a (let's 
say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if 
you increase it to a 1 KB file?

-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Steven D'Aprano

On Monday 02 May 2016 15:04, DFS wrote:

> 0.2 is half as fast as 0.1, here.
> 
> And two small numbers turn into bigger numbers when the webpage is big,
> and soon the download time differences are measured in minutes, not half
> a second.

It takes twice as long to screw a screw into timber than to hammer a nail 
into the same timber.

Therefore if builders change from nails to screws, they can finish building 
the house in half the time.

-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 1:15 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:00 PM, DFS wrote:

I tried the 10-loop test several times with all versions.


Also how, _exactly_, are you testing this?

C:\Python27>python -m timeit "filename='C:\\test.txt';
webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
r = urllib2.urlopen(webpage); f = open(filename, 'w');
f.write(r.read()); f.close();"
10 loops, best of 3: 175 msec per loop

That's a whole lot less the 0.88secs.


Indeed.


-
import requests, urllib, urllib2, pycurl
import time

webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile = "D:\\econpy001.html"
loops   = 10

startTime = time.clock()
for i in range(loops):
urllib.urlretrieve(webpage,webfile)
endTime = time.clock()  
print "Finished urllib in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()  
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
r = requests.get(webpage)
f = open(webfile,"w")
f.write(r.text)
f.close
endTime = time.clock()  
print "Finished requests in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
with open(webfile + str(i) + ".txt", 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, webpage)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
endTime = time.clock()  
print "Finished pycurl in %.2g seconds" %(endTime-startTime)
-

$ python getHTML.py
Finished urllib in 0.88 seconds
Finished urllib2 in 0.83 seconds
Finished requests in 0.89 seconds
Finished pycurl in 1.1 seconds

Those results are consistent.  They go up or down a little, but never 
below 0.82 seconds (for urllib2), or above 1.2 seconds (for pycurl)


VBScript is consistently 0.44 to 0.48

--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

On Sun, May 1, 2016, at 10:04 PM, DFS wrote:
> And two small numbers turn into bigger numbers when the webpage is big, 
> and soon the download time differences are measured in minutes, not half 
> a second.

Are you sure of that? Have you determined that the time is not a
constant overhead verses that the time is directly relational to the
size of the page? If so, how have you determined that?

You aren't showing how you're testing. 0.4s difference is meaningless to
me, if its a constant overhead. If its twice as slow for a 1 meg file,
then you might have an issue. Maybe. You haven't shown that.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

On Sun, May 1, 2016, at 10:00 PM, DFS wrote:
> I tried the 10-loop test several times with all versions.

Also how, _exactly_, are you testing this?

C:\Python27>python -m timeit "filename='C:\\test.txt';
webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
r = urllib2.urlopen(webpage); f = open(filename, 'w');
f.write(r.read()); f.close();"
10 loops, best of 3: 175 msec per loop

That's a whole lot less the 0.88secs.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico

On Mon, May 2, 2016 at 3:04 PM, DFS  wrote:
> And two small numbers turn into bigger numbers when the webpage is big, and
> soon the download time differences are measured in minutes, not half a
> second.
>
> So, any ideas?

So, measure with bigger web pages, and find out whether it's really a
2:1 ratio or a half-second difference. When download times are
measured in minutes, a half second difference is insignificant.

Extrapolating is dangerous.
https://xkcd.com/605/

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 1:00 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 09:50 PM, DFS wrote:

On 5/2/2016 12:40 AM, Chris Angelico wrote:

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:

On Sun, May 1, 2016, at 09:06 PM, DFS wrote:

Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

...

I know it's asking a lot, but is there a really fast AND really short
python solution for this simple thing?


0.88 is not fast enough for you? That's less then a second.


Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA



Unfortunately, the VBScript is twice as fast as any python method.


And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
as fast' isn't particularly meaningful as a metric.


0.2 is half as fast as 0.1, here.

And two small numbers turn into bigger numbers when the webpage is big, 
and soon the download time differences are measured in minutes, not half 
a second.


So, any ideas?
--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 12:49 AM, Ben Finney wrote:

DFS  writes:


Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

[…]

urllib2 and requests were about the same speed as urllib.urlretrieve,
while pycurl was significantly slower (1.2 seconds).


Network access is notoriously erratic in its timing. The program, and
the machine on which it runs, is subject to a great many external
effects once the request is sent — effects which will significantly
alter the delay before a response is completed.

How have you controlled for the wide variability in the duration, for
even a given request by the *same code on the same machine*, at
different points in time?

One simple way to do that: Run the exact same test many times (say,
10 000 or so) on the same machine, and then compute the average of all
the durations.

Do the same for each different program, and then you may have more
meaningfully comparable measurements.



I tried the 10-loop test several times with all versions.

The results were 100% consistent: VBSCript xmlHTTP was always 2x faster 
than any python method.




--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

On Sun, May 1, 2016, at 09:50 PM, DFS wrote:
> On 5/2/2016 12:40 AM, Chris Angelico wrote:
> > On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:
> >> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
> >>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> >>> iterations, vs 0.88 for python.
> >> ...
> >>> I know it's asking a lot, but is there a really fast AND really short
> >>> python solution for this simple thing?
> >>
> >> 0.88 is not fast enough for you? That's less then a second.
> >
> > Also, this is timings of network and disk operations. Unless something
> > pathological is happening, the language used won't make any
> > difference.
> >
> > ChrisA
> 
> 
> Unfortunately, the VBScript is twice as fast as any python method.

And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
as fast' isn't particularly meaningful as a metric. 

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file


On 5/2/2016 12:40 AM, Chris Angelico wrote:

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:

On Sun, May 1, 2016, at 09:06 PM, DFS wrote:

Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

...

I know it's asking a lot, but is there a really fast AND really short
python solution for this simple thing?


0.88 is not fast enough for you? That's less then a second.


Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA



Unfortunately, the VBScript is twice as fast as any python method.




--
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico

On Mon, May 2, 2016 at 2:49 PM, Ben Finney  wrote:
> One simple way to do that: Run the exact same test many times (say,
> 10 000 or so) on the same machine, and then compute the average of all
> the durations.
>
> Do the same for each different program, and then you may have more
> meaningfully comparable measurements.

And also find the minimum and maximum durations, too. Averages don't
always tell the whole story.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Ben Finney

DFS  writes:

> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> iterations, vs 0.88 for python.
>
> […]
>
> urllib2 and requests were about the same speed as urllib.urlretrieve,
> while pycurl was significantly slower (1.2 seconds).

Network access is notoriously erratic in its timing. The program, and
the machine on which it runs, is subject to a great many external
effects once the request is sent — effects which will significantly
alter the delay before a response is completed.

How have you controlled for the wide variability in the duration, for
even a given request by the *same code on the same machine*, at
different points in time?

One simple way to do that: Run the exact same test many times (say,
10 000 or so) on the same machine, and then compute the average of all
the durations.

Do the same for each different program, and then you may have more
meaningfully comparable measurements.

-- 
 \ “We are no more free to believe whatever we want about God than |
  `\ we are free to adopt unjustified beliefs about science or |
_o__)  history […].” —Sam Harris, _The End of Faith_, 2004 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:
> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>> iterations, vs 0.88 for python.
> ...
>> I know it's asking a lot, but is there a really fast AND really short
>> python solution for this simple thing?
>
> 0.88 is not fast enough for you? That's less then a second.

Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Fastest way to retrieve and write html contents to file