Re: Fastest way to retrieve and write html contents to file
On 5/3/2016 2:41 PM, Tim Chase wrote: On 2016-05-03 13:00, DFS wrote: On 5/3/2016 11:28 AM, Tim Chase wrote: On 2016-05-03 00:24, DFS wrote: One small comparison I was able to make was VBA vs python/pyodbc to summarize an Access database. Not quite a fair test, but interesting nonetheless. Access 2003 file Access 2003 VBA code Time: 0.18 seconds same Access 2003 file 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 Time: 0.49 seconds Curious whether you're forcing Access VBA to talk over ODBC or whether Access is using native access/file-handling (and thus bypassing the ODBC overhead)? The latter, which is why I said "not quite a fair test". Can you try the same tests, getting Access/VBA to use ODBC instead to see how much overhead ODBC entails? -tkc Done. I dropped a few extraneous tables from the database (was 114 tables): Access 2003 .mdb file 2,009,164 rows 97 tables (max row = 600288) 725 columns text: 389 boolean: 4 numeric: 261 date-time: 69 binary:2 264 indexes (25 foreign keys)* 299,167,744 bytes on disk 1. DAO Time: 0.15 seconds 2. ADODB, Access ODBC driver, OpenSchema method** Time: 0.26 seconds 3. python, pyodbc, Access ODBC driver Time: 0.42 seconds * despite being written by Microsoft, the Access ODBC driver doesn't support the ODBC SQLForeignKeys function, so the python code doesn't show a count of foreign keys ** the Access ODBC driver doesn't support the adSchemaIndexes or adSchemaForeignKeys query types, so I used DAO code to count indexes and foreign keys. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 2016-05-03 13:00, DFS wrote: > On 5/3/2016 11:28 AM, Tim Chase wrote: > > On 2016-05-03 00:24, DFS wrote: > >> One small comparison I was able to make was VBA vs python/pyodbc > >> to summarize an Access database. Not quite a fair test, but > >> interesting nonetheless. > >> > >> Access 2003 file > >> Access 2003 VBA code > >> Time: 0.18 seconds > >> > >> same Access 2003 file > >> 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 > >> Time: 0.49 seconds > > > > Curious whether you're forcing Access VBA to talk over ODBC or > > whether Access is using native access/file-handling (and thus > > bypassing the ODBC overhead)? > > The latter, which is why I said "not quite a fair test". Can you try the same tests, getting Access/VBA to use ODBC instead to see how much overhead ODBC entails? -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/3/2016 11:28 AM, Tim Chase wrote: On 2016-05-03 00:24, DFS wrote: One small comparison I was able to make was VBA vs python/pyodbc to summarize an Access database. Not quite a fair test, but interesting nonetheless. Access 2003 file Access 2003 VBA code Time: 0.18 seconds same Access 2003 file 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 Time: 0.49 seconds Curious whether you're forcing Access VBA to talk over ODBC or whether Access is using native access/file-handling (and thus bypassing the ODBC overhead)? The latter, which is why I said "not quite a fair test". -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 2016-05-03 00:24, DFS wrote: > One small comparison I was able to make was VBA vs python/pyodbc to > summarize an Access database. Not quite a fair test, but > interesting nonetheless. > > Access 2003 file > Access 2003 VBA code > Time: 0.18 seconds > > same Access 2003 file > 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 > Time: 0.49 seconds Curious whether you're forcing Access VBA to talk over ODBC or whether Access is using native access/file-handling (and thus bypassing the ODBC overhead)? -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/3/2016 12:06 AM, Michael Torrie wrote: Now if you want to talk about processing the data once you have it, there we can talk about speeds and optimization. Be glad to. Helps me learn python, so bring whatever challenge you want and I'll try to keep up. One small comparison I was able to make was VBA vs python/pyodbc to summarize an Access database. Not quite a fair test, but interesting nonetheless. --- Access 2003 file Access 2003 VBA code 2,099,101 rows 114 tables (max row = 600288) 971 columns text: 503 boolean: 4 numeric: 351 date-time: 108 binary:5 309 indexes (25 foreign keys) 333,549,568 bytes on disk Time: 0.18 seconds --- same Access 2003 file 32-bit python 2.7.11 + 32-bit pyodbc 3.0.6 2,099,101 rows 114 tables (max row = 600288) 971 columns text: 503 numeric: 351 date-time: 108 binary:5 boolean: 4 309 indexes (foreign keys na via ODBC*) 333,549,568 bytes on disk Time: 0.49 seconds * the Access ODBC driver doesn't support the SQLForeignKeys function --- -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 05/02/2016 01:37 AM, DFS wrote: > So python matches or beats VBScript at this much larger file. Kewl. If you download something large enough to be meaningful, you'll find the runtime speeds should all converge to something showing your internet connection speed. Try downloading a 4 GB file, for example. You're trying to benchmark an io-bound operation. After you move past the very small and meaningless examples that simply benchmark the overhead of the connection building, you'll find that all languages, even compiled languages like C, should run at the same speed on average. Neither VBS nor Python will be faster than each other. Now if you want to talk about processing the data once you have it, there we can talk about speeds and optimization. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 10:00 PM, Chris Angelico wrote: On Tue, May 3, 2016 at 11:51 AM, DFS wrote: On 5/2/2016 3:19 AM, Chris Angelico wrote: There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). 100 loops Finished VBScript in 3.953 seconds Finished VBScript in 3.608 seconds Finished VBScript in 3.610 seconds Bit of a per-loop speedup going from 10 to 100. How many seconds was it for 10 loops? ChrisA ~0.44 -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Tue, May 3, 2016 at 11:51 AM, DFS wrote: > On 5/2/2016 3:19 AM, Chris Angelico wrote: > >> There's an easier way to test if there's caching happening. Just crank >> the iterations up from 10 to 100 and see what happens to the times. If >> your numbers are perfectly fair, they should be perfectly linear in >> the iteration count; eg a 1.8 second ten-iteration loop should become >> an 18 second hundred-iteration loop. Obviously they won't be exactly >> that, but I would expect them to be reasonably close (eg 17-19 >> seconds, but not 2 seconds). > > > 100 loops > Finished VBScript in 3.953 seconds > Finished VBScript in 3.608 seconds > Finished VBScript in 3.610 seconds > > Bit of a per-loop speedup going from 10 to 100. How many seconds was it for 10 loops? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 3:19 AM, Chris Angelico wrote: There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). 100 loops Finished VBScript in 3.953 seconds Finished VBScript in 3.608 seconds Finished VBScript in 3.610 seconds Bit of a per-loop speedup going from 10 to 100. Then the next thing to test would be to create a deliberately-slow web server, and connect to that. Put a two-second delay into it, to simulate a distant or overloaded server, and see if your logs show the correct result. Something like this: import time try: import http.server as BaseHTTPServer # Python 3 except ImportError: import BaseHTTPServer # Python 2 class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header("Content-type","text/html") self.end_headers() self.wfile.write(b"Hello, ") time.sleep(2) self.wfile.write(b"world!") server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP) server.serve_forever() --- Test that with a web browser or command-line downloader (go to http://127.0.0.1:1234/), and make sure that (a) it produces "Hello, world!", and (b) it takes two seconds. Then set your test scripts to downloading that URL. (Be sure to set them back to low iteration counts first!) If the times are true and fair, they should all come out pretty much the same - ten iterations, twenty seconds. And since all that's changed is the server, this will be an accurate demonstration of what happens in the real world: network requests aren't always fast. Incidentally, you can also watch the server's log to see if it's getting the appropriate number of requests. It may turn out that changing the web server actually materially changes your numbers. Comment out the sleep call and try it again - you might find that your numbers come closer together, because this naive server doesn't send back 204 NOT MODIFIED responses or anything. Again, though, this would prove that you're not actually measuring language performance, because the tests are more dependent on the server than the client. Even if the files themselves aren't being cached, you might find that DNS is. So if you truly want to eliminate variables, replace the name in your URL with an IP address. It's another thing that might mess with your timings, without actually being a language feature. Networking has about four billion variables in it. You're messing with one of the least significant: the programming language :) ChrisA Thanks for the good feedback. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 4:42 AM, Peter Otten wrote: DFS wrote: Is VB using a local web cache, and Python not? I'm not specifying a local web cache with either (wouldn't know how or where to look). If you have Windows, you can try it. I don't have Windows, but if I'm to believe http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive the page is indeed cached and you can disable caching with Option Explicit Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i webpage = "http://econpy.pythonanywhere.com/ex/001.html"; webfile = "D:\econpy001.html" startTime = Timer For i = 1 to 10 Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") xmlHTTP.Open "GET", webpage xmlHTTP.setRequestHeader "Cache-Control", "max-age=0" Tried that, and from later on that stackoverflow page: xmlHTTP.setRequestHeader "Cache-Control", "private" Neither made a difference. In fact, I saw faster times than ever - as low as 0.41 for 10 loops. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 2016-05-02 00:06, DFS wrote: > Then I tested them in loops - the VBScript is MUCH faster: 0.44 for > 10 iterations, vs 0.88 for python. In addition to the other debugging recommendations in sibling threads, a couple other things to try: 1) use a local debugging proxy so that you can compare the headers to see if anything stands out 2) in light of #1, can you confirm/deny whether one is using gzip compression and the other isn't? -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
DFS wrote: >> Is VB using a local web cache, and Python not? > > I'm not specifying a local web cache with either (wouldn't know how or > where to look). If you have Windows, you can try it. I don't have Windows, but if I'm to believe http://stackoverflow.com/questions/5235464/how-to-make-microsoft-xmlhttprequest-honor-cache-control-directive the page is indeed cached and you can disable caching with > Option Explicit > Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i > webpage = "http://econpy.pythonanywhere.com/ex/001.html"; > webfile = "D:\econpy001.html" > startTime = Timer > For i = 1 to 10 > Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") > xmlHTTP.Open "GET", webpage xmlHTTP.setRequestHeader "Cache-Control", "max-age=0" > xmlHTTP.Send > Set fso = CreateObject("Scripting.FileSystemObject") > Set fOut = fso.CreateTextFile(webfile, True) > fOut.WriteLine xmlHTTP.ResponseText > fOut.Close > Set fOut= Nothing > Set fso = Nothing > Set xmlHTTP = Nothing > Next > endTime = Timer > wscript.echo "Finished VBScript in " & FormatNumber(endTime - > startTime,3) & " seconds" > --- > save it to a .vbs file and run it like this: > $cscript /nologo filename.vbs > -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016, at 12:37 AM, DFS wrote: > On 5/2/2016 2:27 AM, Stephen Hansen wrote: > > I'm again going back to the point of: its fast enough. When comparing > > two small numbers, "twice as slow" is meaningless. > > Speed is always meaningful. > > I know python is relatively slow, but it's a cool, concise, powerful > language. I'm extremely impressed by how tight the code can get. I'm sorry, but no. Speed is not always meaningful. It's not even usually meaningful, because you can't quantify what "speed is". In context, you're claiming this is twice as slow (even though my tests show dramatically better performance), but what details are different? You're ignoring the fact that Python might have a constant overhead -- meaning, for a 1k download, it might have X speed cost. For a 1meg download, it might still have the exact same X cost. Looking narrowly, that overhead looks like "twice as slow", but that's not meaningful at all. Looking larger, that overhead is a pittance. You aren't measuring that. > > You have an assumption you haven't answered, that downloading a 10 meg > > file will be twice as slow as downloading this tiny file. You haven't > > proven that at all. > > True. And it has been my assumption - tho not with 10MB file. And that assumption is completely invalid. > I noticed urllib and curl returned the html as is, but urllib2 and > requests added enhancements that should make the data easier to parse. > Based on speed and functionality and documentation, I believe I'll be > using the requests HTTP library (I will actually be doing a small amount > of web scraping). The requests library's added-value is ease-of-use, and its overhead is likely tiny: so using it means you spend less effort making a thing happen. I recommend you embrace this. > VBScript > 1st run: 7.70 seconds > 2nd run: 5.38 > 3rd run: 7.71 > > So python matches or beats VBScript at this much larger file. Kewl. This is what I'm talking about: Python might have a constant overhead, but looking at larger operations, its probably comparable. Not fast, mind you. Python isn't the fastest language out there. But in real world work, its usually fast enough. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 2:27 AM, Stephen Hansen wrote: On Sun, May 1, 2016, at 10:59 PM, DFS wrote: startTime = time.clock() for i in range(loops): r = urllib2.urlopen(webpage) f = open(webfile,"w") f.write(r.read()) f.close endTime = time.clock() print "Finished urllib2 in %.2g seconds" %(endTime-startTime) Yeah on my system I get 1.8 out of this, amounting to 0.18s. You get 1.8 seconds total for the 10 loops? That's less than half as fast as my results. Surprising. I'm again going back to the point of: its fast enough. When comparing two small numbers, "twice as slow" is meaningless. Speed is always meaningful. I know python is relatively slow, but it's a cool, concise, powerful language. I'm extremely impressed by how tight the code can get. You have an assumption you haven't answered, that downloading a 10 meg file will be twice as slow as downloading this tiny file. You haven't proven that at all. True. And it has been my assumption - tho not with 10MB file. I suspect you have a constant overhead of X, and in this toy example, that makes it seem twice as slow. But when downloading a file of size, you'll have the same constant factor, at which point the difference is irrelevant. Good point. Test below. If you believe otherwise, demonstrate it. http://www.usdirectory.com/ypr.aspx?fromform=qsearch&qs=ga&wqhqn=2&qc=Atlanta&rg=30&qhqn=restaurant&sb=zipdisc&ap=2 It's a 58854 byte file when saved to disk (smaller file was 3546 bytes), so this is 16.6x larger. So I would expect python to linearly run in 16.6 * 0.88 = 14.6 seconds. 10 loops per run 1st run $ python timeGetHTML.py Finished urllib in 8.5 seconds Finished urllib2 in 5.6 seconds Finished requests in 7.8 seconds Finished pycurl in 6.5 seconds wait a couple minutes, then 2nd run $ python timeGetHTML.py Finished urllib in 5.6 seconds Finished urllib2 in 5.7 seconds Finished requests in 5.2 seconds Finished pycurl in 6.4 seconds It's a little more than 1/3 of my estimate - so good news. (when I was doing these tests, some of the python results were 0.75 seconds - way too fast, so I checked and no data was written to file, and I couldn't even open the webpage with a browser. Looks like I had been temporarily blocked from the site. After a couple minutes, I was able to access it again). I noticed urllib and curl returned the html as is, but urllib2 and requests added enhancements that should make the data easier to parse. Based on speed and functionality and documentation, I believe I'll be using the requests HTTP library (I will actually be doing a small amount of web scraping). VBScript 1st run: 7.70 seconds 2nd run: 5.38 3rd run: 7.71 So python matches or beats VBScript at this much larger file. Kewl. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016 at 4:47 PM, DFS wrote: > I'm not specifying a local web cache with either (wouldn't know how or where > to look). If you have Windows, you can try it. > --- > Option Explicit > Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i > webpage = "http://econpy.pythonanywhere.com/ex/001.html"; > webfile = "D:\econpy001.html" > startTime = Timer > For i = 1 to 10 > Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") > xmlHTTP.Open "GET", webpage > xmlHTTP.Send > Set fso = CreateObject("Scripting.FileSystemObject") > Set fOut = fso.CreateTextFile(webfile, True) > fOut.WriteLine xmlHTTP.ResponseText > fOut.Close > Set fOut= Nothing > Set fso = Nothing > Set xmlHTTP = Nothing > Next > endTime = Timer > wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) & > " seconds" > --- There's an easier way to test if there's caching happening. Just crank the iterations up from 10 to 100 and see what happens to the times. If your numbers are perfectly fair, they should be perfectly linear in the iteration count; eg a 1.8 second ten-iteration loop should become an 18 second hundred-iteration loop. Obviously they won't be exactly that, but I would expect them to be reasonably close (eg 17-19 seconds, but not 2 seconds). Then the next thing to test would be to create a deliberately-slow web server, and connect to that. Put a two-second delay into it, to simulate a distant or overloaded server, and see if your logs show the correct result. Something like this: import time try: import http.server as BaseHTTPServer # Python 3 except ImportError: import BaseHTTPServer # Python 2 class SlowHTTP(BaseHTTPServer.BaseHTTPRequestHandler): def do_GET(self): self.send_response(200) self.send_header("Content-type","text/html") self.end_headers() self.wfile.write(b"Hello, ") time.sleep(2) self.wfile.write(b"world!") server = BaseHTTPServer.HTTPServer(("", 1234), SlowHTTP) server.serve_forever() --- Test that with a web browser or command-line downloader (go to http://127.0.0.1:1234/), and make sure that (a) it produces "Hello, world!", and (b) it takes two seconds. Then set your test scripts to downloading that URL. (Be sure to set them back to low iteration counts first!) If the times are true and fair, they should all come out pretty much the same - ten iterations, twenty seconds. And since all that's changed is the server, this will be an accurate demonstration of what happens in the real world: network requests aren't always fast. Incidentally, you can also watch the server's log to see if it's getting the appropriate number of requests. It may turn out that changing the web server actually materially changes your numbers. Comment out the sleep call and try it again - you might find that your numbers come closer together, because this naive server doesn't send back 204 NOT MODIFIED responses or anything. Again, though, this would prove that you're not actually measuring language performance, because the tests are more dependent on the server than the client. Even if the files themselves aren't being cached, you might find that DNS is. So if you truly want to eliminate variables, replace the name in your URL with an IP address. It's another thing that might mess with your timings, without actually being a language feature. Networking has about four billion variables in it. You're messing with one of the least significant: the programming language :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 2:05 AM, Steven D'Aprano wrote: On Monday 02 May 2016 15:00, DFS wrote: I tried the 10-loop test several times with all versions. The results were 100% consistent: VBSCript xmlHTTP was always 2x faster than any python method. Are you absolutely sure you're comparing the same job in two languages? As near as I can tell. In VBScript I'm actually dereferencing various objects (that adds to the time), but I don't do that in python. I don't know enough to even know if it's necessary, or good practice, or what. Is VB using a local web cache, and Python not? I'm not specifying a local web cache with either (wouldn't know how or where to look). If you have Windows, you can try it. --- Option Explicit Dim xmlHTTP, fso, fOut, startTime, endTime, webpage, webfile,i webpage = "http://econpy.pythonanywhere.com/ex/001.html"; webfile = "D:\econpy001.html" startTime = Timer For i = 1 to 10 Set xmlHTTP = CreateObject("MSXML2.serverXMLHTTP") xmlHTTP.Open "GET", webpage xmlHTTP.Send Set fso = CreateObject("Scripting.FileSystemObject") Set fOut = fso.CreateTextFile(webfile, True) fOut.WriteLine xmlHTTP.ResponseText fOut.Close Set fOut= Nothing Set fso = Nothing Set xmlHTTP = Nothing Next endTime = Timer wscript.echo "Finished VBScript in " & FormatNumber(endTime - startTime,3) & " seconds" --- save it to a .vbs file and run it like this: $cscript /nologo filename.vbs Are you saving files with both tests? To the same local drive? (To ensure you aren't measuring the difference between "write this file to a slow IDE hard disk, write that file to a fast SSD".) Identical functionality (retrieve webpage, write html to file). Same webpage, written to the same folder on the same hard drive (not SSD). The 10 file writes (open/write/close) don't make a meaningful difference at all: VBScript 0.0156 seconds urllib2 0.0034 seconds This file is 3.55K. Once you are sure that you are comparing the same task in two languages, then make sure the measurement is meaningful. If you change from a (let's say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if you increase it to a 1 KB file? Do you know a webpage I can hit 10x repeatedly to download a good size file? I'm always paranoid they'll block me thinking I'm a "professional" web scraper or something. Thanks -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Sun, May 1, 2016, at 10:59 PM, DFS wrote: > startTime = time.clock() > for i in range(loops): > r = urllib2.urlopen(webpage) > f = open(webfile,"w") > f.write(r.read()) > f.close > endTime = time.clock() > print "Finished urllib2 in %.2g seconds" %(endTime-startTime) Yeah on my system I get 1.8 out of this, amounting to 0.18s. I'm again going back to the point of: its fast enough. When comparing two small numbers, "twice as slow" is meaningless. You have an assumption you haven't answered, that downloading a 10 meg file will be twice as slow as downloading this tiny file. You haven't proven that at all. I suspect you have a constant overhead of X, and in this toy example, that makes it seem twice as slow. But when downloading a file of size, you'll have the same constant factor, at which point the difference is irrelevant. If you believe otherwise, demonstrate it. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Monday 02 May 2016 15:00, DFS wrote: > I tried the 10-loop test several times with all versions. > > The results were 100% consistent: VBSCript xmlHTTP was always 2x faster > than any python method. Are you absolutely sure you're comparing the same job in two languages? Is VB using a local web cache, and Python not? Are you saving files with both tests? To the same local drive? (To ensure you aren't measuring the difference between "write this file to a slow IDE hard disk, write that file to a fast SSD".) Once you are sure that you are comparing the same task in two languages, then make sure the measurement is meaningful. If you change from a (let's say) 1 KB file to a 100 KB file, do you see the same 2 x difference? What if you increase it to a 1 KB file? -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Monday 02 May 2016 15:04, DFS wrote: > 0.2 is half as fast as 0.1, here. > > And two small numbers turn into bigger numbers when the webpage is big, > and soon the download time differences are measured in minutes, not half > a second. It takes twice as long to screw a screw into timber than to hammer a nail into the same timber. Therefore if builders change from nails to screws, they can finish building the house in half the time. -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 1:15 AM, Stephen Hansen wrote: On Sun, May 1, 2016, at 10:00 PM, DFS wrote: I tried the 10-loop test several times with all versions. Also how, _exactly_, are you testing this? C:\Python27>python -m timeit "filename='C:\\test.txt'; webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2; r = urllib2.urlopen(webpage); f = open(filename, 'w'); f.write(r.read()); f.close();" 10 loops, best of 3: 175 msec per loop That's a whole lot less the 0.88secs. Indeed. - import requests, urllib, urllib2, pycurl import time webpage = "http://econpy.pythonanywhere.com/ex/001.html"; webfile = "D:\\econpy001.html" loops = 10 startTime = time.clock() for i in range(loops): urllib.urlretrieve(webpage,webfile) endTime = time.clock() print "Finished urllib in %.2g seconds" %(endTime-startTime) startTime = time.clock() for i in range(loops): r = urllib2.urlopen(webpage) f = open(webfile,"w") f.write(r.read()) f.close endTime = time.clock() print "Finished urllib2 in %.2g seconds" %(endTime-startTime) startTime = time.clock() for i in range(loops): r = requests.get(webpage) f = open(webfile,"w") f.write(r.text) f.close endTime = time.clock() print "Finished requests in %.2g seconds" %(endTime-startTime) startTime = time.clock() for i in range(loops): with open(webfile + str(i) + ".txt", 'wb') as f: c = pycurl.Curl() c.setopt(c.URL, webpage) c.setopt(c.WRITEDATA, f) c.perform() c.close() endTime = time.clock() print "Finished pycurl in %.2g seconds" %(endTime-startTime) - $ python getHTML.py Finished urllib in 0.88 seconds Finished urllib2 in 0.83 seconds Finished requests in 0.89 seconds Finished pycurl in 1.1 seconds Those results are consistent. They go up or down a little, but never below 0.82 seconds (for urllib2), or above 1.2 seconds (for pycurl) VBScript is consistently 0.44 to 0.48 -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Sun, May 1, 2016, at 10:04 PM, DFS wrote: > And two small numbers turn into bigger numbers when the webpage is big, > and soon the download time differences are measured in minutes, not half > a second. Are you sure of that? Have you determined that the time is not a constant overhead verses that the time is directly relational to the size of the page? If so, how have you determined that? You aren't showing how you're testing. 0.4s difference is meaningless to me, if its a constant overhead. If its twice as slow for a 1 meg file, then you might have an issue. Maybe. You haven't shown that. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Sun, May 1, 2016, at 10:00 PM, DFS wrote: > I tried the 10-loop test several times with all versions. Also how, _exactly_, are you testing this? C:\Python27>python -m timeit "filename='C:\\test.txt'; webpage='http://econpy.pythonanywhere.com/ex/001.html'; import urllib2; r = urllib2.urlopen(webpage); f = open(filename, 'w'); f.write(r.read()); f.close();" 10 loops, best of 3: 175 msec per loop That's a whole lot less the 0.88secs. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016 at 3:04 PM, DFS wrote: > And two small numbers turn into bigger numbers when the webpage is big, and > soon the download time differences are measured in minutes, not half a > second. > > So, any ideas? So, measure with bigger web pages, and find out whether it's really a 2:1 ratio or a half-second difference. When download times are measured in minutes, a half second difference is insignificant. Extrapolating is dangerous. https://xkcd.com/605/ ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 1:00 AM, Stephen Hansen wrote: On Sun, May 1, 2016, at 09:50 PM, DFS wrote: On 5/2/2016 12:40 AM, Chris Angelico wrote: On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen wrote: On Sun, May 1, 2016, at 09:06 PM, DFS wrote: Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 iterations, vs 0.88 for python. ... I know it's asking a lot, but is there a really fast AND really short python solution for this simple thing? 0.88 is not fast enough for you? That's less then a second. Also, this is timings of network and disk operations. Unless something pathological is happening, the language used won't make any difference. ChrisA Unfortunately, the VBScript is twice as fast as any python method. And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice as fast' isn't particularly meaningful as a metric. 0.2 is half as fast as 0.1, here. And two small numbers turn into bigger numbers when the webpage is big, and soon the download time differences are measured in minutes, not half a second. So, any ideas? -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 12:49 AM, Ben Finney wrote: DFS writes: Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 iterations, vs 0.88 for python. […] urllib2 and requests were about the same speed as urllib.urlretrieve, while pycurl was significantly slower (1.2 seconds). Network access is notoriously erratic in its timing. The program, and the machine on which it runs, is subject to a great many external effects once the request is sent — effects which will significantly alter the delay before a response is completed. How have you controlled for the wide variability in the duration, for even a given request by the *same code on the same machine*, at different points in time? One simple way to do that: Run the exact same test many times (say, 10 000 or so) on the same machine, and then compute the average of all the durations. Do the same for each different program, and then you may have more meaningfully comparable measurements. I tried the 10-loop test several times with all versions. The results were 100% consistent: VBSCript xmlHTTP was always 2x faster than any python method. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Sun, May 1, 2016, at 09:50 PM, DFS wrote: > On 5/2/2016 12:40 AM, Chris Angelico wrote: > > On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen wrote: > >> On Sun, May 1, 2016, at 09:06 PM, DFS wrote: > >>> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 > >>> iterations, vs 0.88 for python. > >> ... > >>> I know it's asking a lot, but is there a really fast AND really short > >>> python solution for this simple thing? > >> > >> 0.88 is not fast enough for you? That's less then a second. > > > > Also, this is timings of network and disk operations. Unless something > > pathological is happening, the language used won't make any > > difference. > > > > ChrisA > > > Unfortunately, the VBScript is twice as fast as any python method. And 0.2 is twice as fast as 0.1. When you have two small numbers, 'twice as fast' isn't particularly meaningful as a metric. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On 5/2/2016 12:40 AM, Chris Angelico wrote: On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen wrote: On Sun, May 1, 2016, at 09:06 PM, DFS wrote: Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 iterations, vs 0.88 for python. ... I know it's asking a lot, but is there a really fast AND really short python solution for this simple thing? 0.88 is not fast enough for you? That's less then a second. Also, this is timings of network and disk operations. Unless something pathological is happening, the language used won't make any difference. ChrisA Unfortunately, the VBScript is twice as fast as any python method. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016 at 2:49 PM, Ben Finney wrote: > One simple way to do that: Run the exact same test many times (say, > 10 000 or so) on the same machine, and then compute the average of all > the durations. > > Do the same for each different program, and then you may have more > meaningfully comparable measurements. And also find the minimum and maximum durations, too. Averages don't always tell the whole story. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
DFS writes: > Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 > iterations, vs 0.88 for python. > > […] > > urllib2 and requests were about the same speed as urllib.urlretrieve, > while pycurl was significantly slower (1.2 seconds). Network access is notoriously erratic in its timing. The program, and the machine on which it runs, is subject to a great many external effects once the request is sent — effects which will significantly alter the delay before a response is completed. How have you controlled for the wide variability in the duration, for even a given request by the *same code on the same machine*, at different points in time? One simple way to do that: Run the exact same test many times (say, 10 000 or so) on the same machine, and then compute the average of all the durations. Do the same for each different program, and then you may have more meaningfully comparable measurements. -- \ “We are no more free to believe whatever we want about God than | `\ we are free to adopt unjustified beliefs about science or | _o__) history […].” —Sam Harris, _The End of Faith_, 2004 | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Mon, May 2, 2016 at 2:34 PM, Stephen Hansen wrote: > On Sun, May 1, 2016, at 09:06 PM, DFS wrote: >> Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 >> iterations, vs 0.88 for python. > ... >> I know it's asking a lot, but is there a really fast AND really short >> python solution for this simple thing? > > 0.88 is not fast enough for you? That's less then a second. Also, this is timings of network and disk operations. Unless something pathological is happening, the language used won't make any difference. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fastest way to retrieve and write html contents to file
On Sun, May 1, 2016, at 09:06 PM, DFS wrote: > Then I tested them in loops - the VBScript is MUCH faster: 0.44 for 10 > iterations, vs 0.88 for python. ... > I know it's asking a lot, but is there a really fast AND really short > python solution for this simple thing? 0.88 is not fast enough for you? That's less then a second. -- Stephen Hansen m e @ i x o k a i . i o -- https://mail.python.org/mailman/listinfo/python-list