Re: Fastest way to retrieve and write html contents to file

2016-05-03 Thread DFS

On 5/3/2016 2:41 PM, Tim Chase wrote:

On 2016-05-03 13:00, DFS wrote:

On 5/3/2016 11:28 AM, Tim Chase wrote:

On 2016-05-03 00:24, DFS wrote:

One small comparison I was able to make was VBA vs python/pyodbc
to summarize an Access database.  Not quite a fair test, but
interesting nonetheless.

Access 2003 file
Access 2003 VBA code
Time: 0.18 seconds

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
Time: 0.49 seconds


Curious whether you're forcing Access VBA to talk over ODBC or
whether Access is using native access/file-handling (and thus
bypassing the ODBC overhead)?


The latter, which is why I said "not quite a fair test".


Can you try the same tests, getting Access/VBA to use ODBC instead to
see how much overhead ODBC entails?

-tkc



Done.

I dropped a few extraneous tables from the database (was 114 tables):

Access 2003 .mdb file
2,009,164 rows
97 tables  (max row = 600288)
725 columns
  text:  389
  boolean:   4
  numeric:   261
  date-time: 69
  binary:2
264 indexes (25 foreign keys)*
299,167,744 bytes on disk


1. DAO
   Time: 0.15 seconds

2. ADODB, Access ODBC driver, OpenSchema method**
   Time: 0.26 seconds

3. python, pyodbc, Access ODBC driver
   Time: 0.42 seconds




* despite being written by Microsoft, the Access ODBC driver doesn't
  support the ODBC SQLForeignKeys function, so the python code doesn't
  show a count of foreign keys

** the Access ODBC driver doesn't support the adSchemaIndexes or
   adSchemaForeignKeys query types, so I used DAO code to count
   indexes and foreign keys.






--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-03 Thread Tim Chase
On 2016-05-03 13:00, DFS wrote:
> On 5/3/2016 11:28 AM, Tim Chase wrote:
> > On 2016-05-03 00:24, DFS wrote:
> >> One small comparison I was able to make was VBA vs python/pyodbc
> >> to summarize an Access database.  Not quite a fair test, but
> >> interesting nonetheless.
> >>
> >> Access 2003 file
> >> Access 2003 VBA code
> >> Time: 0.18 seconds
> >>
> >> same Access 2003 file
> >> 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
> >> Time: 0.49 seconds
> >
> > Curious whether you're forcing Access VBA to talk over ODBC or
> > whether Access is using native access/file-handling (and thus
> > bypassing the ODBC overhead)?
> 
> The latter, which is why I said "not quite a fair test".

Can you try the same tests, getting Access/VBA to use ODBC instead to
see how much overhead ODBC entails?

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-03 Thread DFS

On 5/3/2016 11:28 AM, Tim Chase wrote:

On 2016-05-03 00:24, DFS wrote:

One small comparison I was able to make was VBA vs python/pyodbc to
summarize an Access database.  Not quite a fair test, but
interesting nonetheless.

Access 2003 file
Access 2003 VBA code
Time: 0.18 seconds

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
Time: 0.49 seconds


Curious whether you're forcing Access VBA to talk over ODBC or
whether Access is using native access/file-handling (and thus
bypassing the ODBC overhead)?



The latter, which is why I said "not quite a fair test".


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-03 Thread Tim Chase
On 2016-05-03 00:24, DFS wrote:
> One small comparison I was able to make was VBA vs python/pyodbc to 
> summarize an Access database.  Not quite a fair test, but
> interesting nonetheless.
> 
> Access 2003 file
> Access 2003 VBA code
> Time: 0.18 seconds
>
> same Access 2003 file
> 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6
> Time: 0.49 seconds

Curious whether you're forcing Access VBA to talk over ODBC or
whether Access is using native access/file-handling (and thus
bypassing the ODBC overhead)?

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/3/2016 12:06 AM, Michael Torrie wrote:


Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.


Be glad to.  Helps me learn python, so bring whatever challenge you want 
and I'll try to keep up.


One small comparison I was able to make was VBA vs python/pyodbc to 
summarize an Access database.  Not quite a fair test, but interesting 
nonetheless.


---

Access 2003 file
Access 2003 VBA code

2,099,101 rows
114 tables  (max row = 600288)
971 columns
  text:  503
  boolean:   4
  numeric:   351
  date-time: 108
  binary:5
309 indexes (25 foreign keys)
333,549,568 bytes on disk
Time: 0.18 seconds

---

same Access 2003 file
32-bit python 2.7.11 + 32-bit pyodbc 3.0.6

2,099,101 rows
114 tables (max row = 600288)
971  columns
  text:  503
  numeric:   351
  date-time: 108
  binary:5
  boolean:   4
309 indexes (foreign keys na via ODBC*)
333,549,568 bytes on disk
Time: 0.49 seconds

* the Access ODBC driver doesn't support
  the SQLForeignKeys function

---

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Michael Torrie
On 05/02/2016 01:37 AM, DFS wrote:
> So python matches or beats VBScript at this much larger file.  Kewl.

If you download something large enough to be meaningful, you'll find the
runtime speeds should all converge to something showing your internet
connection speed.  Try downloading a 4 GB file, for example.  You're
trying to benchmark an io-bound operation.  After you move past the very
small and meaningless examples that simply benchmark the overhead of the
connection building, you'll find that all languages, even compiled
languages like C, should run at the same speed on average.  Neither VBS
nor Python will be faster than each other.

Now if you want to talk about processing the data once you have it,
there we can talk about speeds and optimization.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 10:00 PM, Chris Angelico wrote:

On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:

On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).



100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.


How many seconds was it for 10 loops?

ChrisA


~0.44


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico
On Tue, May 3, 2016 at 11:51 AM, DFS  wrote:
> On 5/2/2016 3:19 AM, Chris Angelico wrote:
>
>> There's an easier way to test if there's caching happening. Just crank
>> the iterations up from 10 to 100 and see what happens to the times. If
>> your numbers are perfectly fair, they should be perfectly linear in
>> the iteration count; eg a 1.8 second ten-iteration loop should become
>> an 18 second hundred-iteration loop. Obviously they won't be exactly
>> that, but I would expect them to be reasonably close (eg 17-19
>> seconds, but not 2 seconds).
>
>
> 100 loops
> Finished VBScript in 3.953 seconds
> Finished VBScript in 3.608 seconds
> Finished VBScript in 3.610 seconds
>
> Bit of a per-loop speedup going from 10 to 100.

How many seconds was it for 10 loops?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 3:19 AM, Chris Angelico wrote:


There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).


100 loops
Finished VBScript in 3.953 seconds
Finished VBScript in 3.608 seconds
Finished VBScript in 3.610 seconds

Bit of a per-loop speedup going from 10 to 100.



Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:



import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA



Thanks for the good feedback.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 4:42 AM, Peter Otten wrote:

DFS wrote:


Is VB using a local web cache, and Python not?


I'm not specifying a local web cache with either (wouldn't know how or
where to look).  If you have Windows, you can try it.


I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with


Option Explicit
Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile  = "D:\econpy001.html"
startTime = Timer
For i = 1 to 10
Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
xmlHTTP.Open "GET", webpage


  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"



Tried that, and from later on that stackoverflow page:

xmlHTTP.setRequestHeader "Cache-Control", "private"

Neither made a difference.  In fact, I saw faster times than ever - as 
low as 0.41 for 10 loops.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Tim Chase
On 2016-05-02 00:06, DFS wrote:
> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for
> 10 iterations, vs 0.88 for python.

In addition to the other debugging recommendations in sibling
threads, a couple other things to try:

1) use a local debugging proxy so that you can compare the headers to
see if anything stands out

2) in light of #1, can you confirm/deny whether one is using gzip
compression and the other isn't?

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Peter Otten
DFS wrote:

>> Is VB using a local web cache, and Python not?
> 
> I'm not specifying a local web cache with either (wouldn't know how or
> where to look).  If you have Windows, you can try it.

I don't have Windows, but if I'm to believe

http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive

the page is indeed cached and you can disable caching with

> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
> Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
> xmlHTTP.Open "GET", webpage
  
  xmlHTTP.setRequestHeader "Cache-Control", "max-age=0"

> xmlHTTP.Send
> Set fso = CreateObject("Scripting.FileSystemObject")
> Set fOut = fso.CreateTextFile(webfile, True)
> fOut.WriteLine xmlHTTP.ResponseText
> fOut.Close
> Set fOut= Nothing
> Set fso = Nothing
> Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime -
> startTime,3) & " seconds"
> ---
> save it to a .vbs file and run it like this:
> $cscript /nologo filename.vbs
> 


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Stephen Hansen
On Mon, May 2, 2016, at 12:37 AM, DFS wrote:
> On 5/2/2016 2:27 AM, Stephen Hansen wrote:
> > I'm again going back to the point of: its fast enough. When comparing
> > two small numbers, "twice as slow" is meaningless.
> 
> Speed is always meaningful.
> 
> I know python is relatively slow, but it's a cool, concise, powerful 
> language.  I'm extremely impressed by how tight the code can get.

I'm sorry, but no. Speed is not always meaningful. 

It's not even usually meaningful, because you can't quantify what "speed
is". In context, you're claiming this is twice as slow (even though my
tests show dramatically better performance), but what details are
different?

You're ignoring the fact that Python might have a constant overhead --
meaning, for a 1k download, it might have X speed cost. For a 1meg
download, it might still have the exact same X cost.

Looking narrowly, that overhead looks like "twice as slow", but that's
not meaningful at all. Looking larger, that overhead is a pittance.

You aren't measuring that.

> > You have an assumption you haven't answered, that downloading a 10 meg
> > file will be twice as slow as downloading this tiny file. You haven't
> > proven that at all.
> 
> True.  And it has been my assumption - tho not with 10MB file.

And that assumption is completely invalid.

> I noticed urllib and curl returned the html as is, but urllib2 and 
> requests added enhancements that should make the data easier to parse. 
> Based on speed and functionality and documentation, I believe I'll be 
> using the requests HTTP library (I will actually be doing a small amount 
> of web scraping).

The requests library's added-value is ease-of-use, and its overhead is
likely tiny: so using it means you spend less effort making a thing
happen. I recommend you embrace this. 

> VBScript
> 1st run: 7.70 seconds
> 2nd run: 5.38
> 3rd run: 7.71
> 
> So python matches or beats VBScript at this much larger file.  Kewl.

This is what I'm talking about: Python might have a constant overhead,
but looking at larger operations, its probably comparable. Not fast,
mind you. Python isn't the fastest language out there. But in real world
work, its usually fast enough.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread DFS

On 5/2/2016 2:27 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:59 PM, DFS wrote:

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)


Yeah on my system I get 1.8 out of this, amounting to 0.18s.


You get 1.8 seconds total for the 10 loops?  That's less than half as 
fast as my results.  Surprising.




I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.


Speed is always meaningful.

I know python is relatively slow, but it's a cool, concise, powerful 
language.  I'm extremely impressed by how tight the code can get.




You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all.


True.  And it has been my assumption - tho not with 10MB file.



I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant.


Good point.  Test below.



If you believe otherwise, demonstrate it.


http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2

It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), 
so this is 16.6x larger.  So I would expect python to linearly run in 
16.6 * 0.88 = 14.6 seconds.


10 loops per run

1st run
$ python timeGetHTML.py
Finished urllib in 8.5 seconds
Finished urllib2 in 5.6 seconds
Finished requests in 7.8 seconds
Finished pycurl in 6.5 seconds

wait a couple minutes, then 2nd run
$ python timeGetHTML.py
Finished urllib in 5.6 seconds
Finished urllib2 in 5.7 seconds
Finished requests in 5.2 seconds
Finished pycurl in 6.4 seconds

It's a little more than 1/3 of my estimate - so good news.

(when I was doing these tests, some of the python results were 0.75 
seconds - way too fast, so I checked and no data was written to file, 
and I couldn't even open the webpage with a browser.  Looks like I had 
been temporarily blocked from the site.  After a couple minutes, I was 
able to access it again).


I noticed urllib and curl returned the html as is, but urllib2 and 
requests added enhancements that should make the data easier to parse. 
Based on speed and functionality and documentation, I believe I'll be 
using the requests HTTP library (I will actually be doing a small amount 
of web scraping).



VBScript
1st run: 7.70 seconds
2nd run: 5.38
3rd run: 7.71

So python matches or beats VBScript at this much larger file.  Kewl.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-02 Thread Chris Angelico
On Mon, May 2, 2016 at 4:47 PM, DFS  wrote:
> I'm not specifying a local web cache with either (wouldn't know how or where
> to look).  If you have Windows, you can try it.
> ---
> Option Explicit
> Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
> webpage = "http://econpy.pythonanywhere.com/ex/001.html";
> webfile  = "D:\econpy001.html"
> startTime = Timer
> For i = 1 to 10
>  Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
>  xmlHTTP.Open "GET", webpage
>  xmlHTTP.Send
>  Set fso = CreateObject("Scripting.FileSystemObject")
>  Set fOut = fso.CreateTextFile(webfile, True)
>   fOut.WriteLine xmlHTTP.ResponseText
>  fOut.Close
>  Set fOut= Nothing
>  Set fso = Nothing
>  Set xmlHTTP = Nothing
> Next
> endTime = Timer
> wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) &
> " seconds"
> ---

There's an easier way to test if there's caching happening. Just crank
the iterations up from 10 to 100 and see what happens to the times. If
your numbers are perfectly fair, they should be perfectly linear in
the iteration count; eg a 1.8 second ten-iteration loop should become
an 18 second hundred-iteration loop. Obviously they won't be exactly
that, but I would expect them to be reasonably close (eg 17-19
seconds, but not 2 seconds).

Then the next thing to test would be to create a deliberately-slow web
server, and connect to that. Put a two-second delay into it, to
simulate a distant or overloaded server, and see if your logs show the
correct result. Something like this:



import time
try:
import http.server as BaseHTTPServer # Python 3
except ImportError:
import BaseHTTPServer # Python 2

class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-type","text/html")
self.end_headers()
self.wfile.write(b"Hello, ")
time.sleep(2)
self.wfile.write(b"world!")

server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP)
server.serve_forever()

---

Test that with a web browser or command-line downloader (go to
http://127.0.0.1:1234/), and make sure that (a) it produces "Hello,
world!", and (b) it takes two seconds. Then set your test scripts to
downloading that URL. (Be sure to set them back to low iteration
counts first!) If the times are true and fair, they should all come
out pretty much the same - ten iterations, twenty seconds. And since
all that's changed is the server, this will be an accurate
demonstration of what happens in the real world: network requests
aren't always fast. Incidentally, you can also watch the server's log
to see if it's getting the appropriate number of requests.

It may turn out that changing the web server actually materially
changes your numbers. Comment out the sleep call and try it again -
you might find that your numbers come closer together, because this
naive server doesn't send back 204 NOT MODIFIED responses or anything.
Again, though, this would prove that you're not actually measuring
language performance, because the tests are more dependent on the
server than the client.

Even if the files themselves aren't being cached, you might find that
DNS is. So if you truly want to eliminate variables, replace the name
in your URL with an IP address. It's another thing that might mess
with your timings, without actually being a language feature.

Networking has about four billion variables in it. You're messing with
one of the least significant: the programming language :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread DFS

On 5/2/2016 2:05 AM, Steven D'Aprano wrote:

On Monday 02 May 2016 15:00, DFS wrote:


I tried the 10-loop test several times with all versions.

The results were 100% consistent: VBSCript xmlHTTP was always 2x faster
than any python method.



Are you absolutely sure you're comparing the same job in two languages?


As near as I can tell.  In VBScript I'm actually dereferencing various 
objects (that adds to the time), but I don't do that in python.  I don't 
know enough to even know if it's necessary, or good practice, or what.






Is VB using a local web cache, and Python not?


I'm not specifying a local web cache with either (wouldn't know how or 
where to look).  If you have Windows, you can try it.

---
Option Explicit
Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i
webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile  = "D:\econpy001.html"
startTime = Timer
For i = 1 to 10
 Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP")
 xmlHTTP.Open "GET", webpage
 xmlHTTP.Send
 Set fso = CreateObject("Scripting.FileSystemObject")
 Set fOut = fso.CreateTextFile(webfile, True)
  fOut.WriteLine xmlHTTP.ResponseText
 fOut.Close
 Set fOut= Nothing
 Set fso = Nothing
 Set xmlHTTP = Nothing
Next
endTime = Timer
wscript.echo "Finished VBScript in " & FormatNumber(endTime - 
startTime,3) & " seconds"

---
save it to a .vbs file and run it like this:
$cscript /nologo filename.vbs



Are you saving files with both
tests? To the same local drive? (To ensure you aren't measuring the
difference between "write this file to a slow IDE hard disk, write that file
to a fast SSD".)


Identical functionality (retrieve webpage, write html to file).  Same 
webpage, written to the same folder on the same hard drive (not SSD).


The 10 file writes (open/write/close) don't make a meaningful difference 
at all:

VBScript 0.0156 seconds
urllib2  0.0034 seconds

This file is 3.55K.



Once you are sure that you are comparing the same task in two languages,
then make sure the measurement is meaningful. If you change from a (let's
say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if
you increase it to a 1 KB file?


Do you know a webpage I can hit 10x repeatedly to download a good size 
file?  I'm always paranoid they'll block me thinking I'm a 
"professional" web scraper or something.


Thanks


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Stephen Hansen
On Sun, May 1, 2016, at 10:59 PM, DFS wrote:
> startTime = time.clock()
> for i in range(loops):
>   r = urllib2.urlopen(webpage)
>   f = open(webfile,"w")
>   f.write(r.read())
>   f.close
> endTime = time.clock()  
> print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

Yeah on my system I get 1.8 out of this, amounting to 0.18s. 

I'm again going back to the point of: its fast enough. When comparing
two small numbers, "twice as slow" is meaningless.

You have an assumption you haven't answered, that downloading a 10 meg
file will be twice as slow as downloading this tiny file. You haven't
proven that at all. 

I suspect you have a constant overhead of X, and in this toy example,
that makes it seem twice as slow. But when downloading a file of size,
you'll have the same constant factor, at which point the difference is
irrelevant. 

If you believe otherwise, demonstrate it.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Steven D'Aprano
On Monday 02 May 2016 15:00, DFS wrote:

> I tried the 10-loop test several times with all versions.
> 
> The results were 100% consistent: VBSCript xmlHTTP was always 2x faster
> than any python method.


Are you absolutely sure you're comparing the same job in two languages? Is 
VB using a local web cache, and Python not? Are you saving files with both 
tests? To the same local drive? (To ensure you aren't measuring the 
difference between "write this file to a slow IDE hard disk, write that file 
to a fast SSD".)

Once you are sure that you are comparing the same task in two languages, 
then make sure the measurement is meaningful. If you change from a (let's 
say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if 
you increase it to a 1 KB file?


-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Steven D'Aprano
On Monday 02 May 2016 15:04, DFS wrote:

> 0.2 is half as fast as 0.1, here.
> 
> And two small numbers turn into bigger numbers when the webpage is big,
> and soon the download time differences are measured in minutes, not half
> a second.

It takes twice as long to screw a screw into timber than to hammer a nail 
into the same timber.

Therefore if builders change from nails to screws, they can finish building 
the house in half the time.



-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread DFS

On 5/2/2016 1:15 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 10:00 PM, DFS wrote:

I tried the 10-loop test several times with all versions.


Also how, _exactly_, are you testing this?

C:\Python27>python -m timeit "filename='C:\\test.txt';
webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
r = urllib2.urlopen(webpage); f = open(filename, 'w');
f.write(r.read()); f.close();"
10 loops, best of 3: 175 msec per loop

That's a whole lot less the 0.88secs.


Indeed.


-
import requests, urllib, urllib2, pycurl
import time

webpage = "http://econpy.pythonanywhere.com/ex/001.html";
webfile = "D:\\econpy001.html"
loops   = 10

startTime = time.clock()
for i in range(loops):
urllib.urlretrieve(webpage,webfile)
endTime = time.clock()  
print "Finished urllib in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
r = urllib2.urlopen(webpage)
f = open(webfile,"w")
f.write(r.read())
f.close
endTime = time.clock()  
print "Finished urllib2 in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
r = requests.get(webpage)
f = open(webfile,"w")
f.write(r.text)
f.close
endTime = time.clock()  
print "Finished requests in %.2g seconds" %(endTime-startTime)

startTime = time.clock()
for i in range(loops):
with open(webfile + str(i) + ".txt", 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, webpage)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
endTime = time.clock()  
print "Finished pycurl in %.2g seconds" %(endTime-startTime)
-

$ python getHTML.py
Finished urllib in 0.88 seconds
Finished urllib2 in 0.83 seconds
Finished requests in 0.89 seconds
Finished pycurl in 1.1 seconds

Those results are consistent.  They go up or down a little, but never 
below 0.82 seconds (for urllib2), or above 1.2 seconds (for pycurl)


VBScript is consistently 0.44 to 0.48

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Stephen Hansen
On Sun, May 1, 2016, at 10:04 PM, DFS wrote:
> And two small numbers turn into bigger numbers when the webpage is big, 
> and soon the download time differences are measured in minutes, not half 
> a second.

Are you sure of that? Have you determined that the time is not a
constant overhead verses that the time is directly relational to the
size of the page? If so, how have you determined that?

You aren't showing how you're testing. 0.4s difference is meaningless to
me, if its a constant overhead. If its twice as slow for a 1 meg file,
then you might have an issue. Maybe. You haven't shown that.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Stephen Hansen
On Sun, May 1, 2016, at 10:00 PM, DFS wrote:
> I tried the 10-loop test several times with all versions.

Also how, _exactly_, are you testing this?

C:\Python27>python -m timeit "filename='C:\\test.txt';
webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2;
r = urllib2.urlopen(webpage); f = open(filename, 'w');
f.write(r.read()); f.close();"
10 loops, best of 3: 175 msec per loop

That's a whole lot less the 0.88secs.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico
On Mon, May 2, 2016 at 3:04 PM, DFS  wrote:
> And two small numbers turn into bigger numbers when the webpage is big, and
> soon the download time differences are measured in minutes, not half a
> second.
>
> So, any ideas?

So, measure with bigger web pages, and find out whether it's really a
2:1 ratio or a half-second difference. When download times are
measured in minutes, a half second difference is insignificant.

Extrapolating is dangerous.
https://xkcd.com/605/

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread DFS

On 5/2/2016 1:00 AM, Stephen Hansen wrote:

On Sun, May 1, 2016, at 09:50 PM, DFS wrote:

On 5/2/2016 12:40 AM, Chris Angelico wrote:

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:

On Sun, May 1, 2016, at 09:06 PM, DFS wrote:

Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

...

I know it's asking a lot, but is there a really fast AND really short
python solution for this simple thing?


0.88 is not fast enough for you? That's less then a second.


Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA



Unfortunately, the VBScript is twice as fast as any python method.


And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
as fast' isn't particularly meaningful as a metric.


0.2 is half as fast as 0.1, here.

And two small numbers turn into bigger numbers when the webpage is big, 
and soon the download time differences are measured in minutes, not half 
a second.


So, any ideas?
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread DFS

On 5/2/2016 12:49 AM, Ben Finney wrote:

DFS  writes:


Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

[…]

urllib2 and requests were about the same speed as urllib.urlretrieve,
while pycurl was significantly slower (1.2 seconds).


Network access is notoriously erratic in its timing. The program, and
the machine on which it runs, is subject to a great many external
effects once the request is sent — effects which will significantly
alter the delay before a response is completed.

How have you controlled for the wide variability in the duration, for
even a given request by the *same code on the same machine*, at
different points in time?

One simple way to do that: Run the exact same test many times (say,
10 000 or so) on the same machine, and then compute the average of all
the durations.

Do the same for each different program, and then you may have more
meaningfully comparable measurements.



I tried the 10-loop test several times with all versions.

The results were 100% consistent: VBSCript xmlHTTP was always 2x faster 
than any python method.




--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Stephen Hansen
On Sun, May 1, 2016, at 09:50 PM, DFS wrote:
> On 5/2/2016 12:40 AM, Chris Angelico wrote:
> > On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:
> >> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
> >>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> >>> iterations, vs 0.88 for python.
> >> ...
> >>> I know it's asking a lot, but is there a really fast AND really short
> >>> python solution for this simple thing?
> >>
> >> 0.88 is not fast enough for you? That's less then a second.
> >
> > Also, this is timings of network and disk operations. Unless something
> > pathological is happening, the language used won't make any
> > difference.
> >
> > ChrisA
> 
> 
> Unfortunately, the VBScript is twice as fast as any python method.

And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice
as fast' isn't particularly meaningful as a metric. 

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread DFS

On 5/2/2016 12:40 AM, Chris Angelico wrote:

On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:

On Sun, May 1, 2016, at 09:06 PM, DFS wrote:

Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
iterations, vs 0.88 for python.

...

I know it's asking a lot, but is there a really fast AND really short
python solution for this simple thing?


0.88 is not fast enough for you? That's less then a second.


Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA



Unfortunately, the VBScript is twice as fast as any python method.




--
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico
On Mon, May 2, 2016 at 2:49 PM, Ben Finney  wrote:
> One simple way to do that: Run the exact same test many times (say,
> 10 000 or so) on the same machine, and then compute the average of all
> the durations.
>
> Do the same for each different program, and then you may have more
> meaningfully comparable measurements.

And also find the minimum and maximum durations, too. Averages don't
always tell the whole story.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Ben Finney
DFS  writes:

> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
> iterations, vs 0.88 for python.
>
> […]
>
> urllib2 and requests were about the same speed as urllib.urlretrieve,
> while pycurl was significantly slower (1.2 seconds).

Network access is notoriously erratic in its timing. The program, and
the machine on which it runs, is subject to a great many external
effects once the request is sent — effects which will significantly
alter the delay before a response is completed.

How have you controlled for the wide variability in the duration, for
even a given request by the *same code on the same machine*, at
different points in time?

One simple way to do that: Run the exact same test many times (say,
10 000 or so) on the same machine, and then compute the average of all
the durations.

Do the same for each different program, and then you may have more
meaningfully comparable measurements.

-- 
 \ “We are no more free to believe whatever we want about God than |
  `\ we are free to adopt unjustified beliefs about science or |
_o__)  history […].” —Sam Harris, _The End of Faith_, 2004 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Chris Angelico
On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen  wrote:
> On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10
>> iterations, vs 0.88 for python.
> ...
>> I know it's asking a lot, but is there a really fast AND really short
>> python solution for this simple thing?
>
> 0.88 is not fast enough for you? That's less then a second.

Also, this is timings of network and disk operations. Unless something
pathological is happening, the language used won't make any
difference.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fastest way to retrieve and write html contents to file

2016-05-01 Thread Stephen Hansen
On Sun, May 1, 2016, at 09:06 PM, DFS wrote:
> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 
> iterations, vs 0.88 for python.
...
> I know it's asking a lot, but is there a really fast AND really short 
> python solution for this simple thing?

0.88 is not fast enough for you? That's less then a second.

-- 
Stephen Hansen
  m e @ i x o k a i . i o
-- 
https://mail.python.org/mailman/listinfo/python-list