On Sat, Oct 2, 2010 at 4:58 AM, jimgardener <jimgarde...@gmail.com> wrote: > hi > while trying out urllib.urlopen ,I wrote this code to read a url and > return the data length > > import datetime,time,urllib > > def get_page_size(pageurlstr): > h=urllib.urlopen(pageurlstr) > data=h.read() > return len(data) > > while True: > print 'reading url www.google.com > at',datetime.datetime.now().isoformat(' ') > print 'size=%d'%get_page_size('http://www.google.com') > time.sleep(5) > > > I got this output > > reading url www.google.com at 2010-10-02 17:22:24.691654 > size=9512 > reading url www.google.com at 2010-10-02 17:22:30.681236 > size=9530 > reading url www.google.com at 2010-10-02 17:22:36.886369 > size=9530 > reading url www.google.com at 2010-10-02 17:22:42.315392 > size=9512 > reading url www.google.com at 2010-10-02 17:22:48.763693 > size=9512 > reading url www.google.com at 2010-10-02 17:22:54.711666 > size=9548 > reading url www.google.com at 2010-10-02 17:23:00.151843 > size=9530 > reading url www.google.com at 2010-10-02 17:23:05.844620 > size=9548 > > > Why is it that the sizes are different?
Because Google does not always send back the *exact* same HTML every time you request their homepage (note how small the variance is). You can easily verify this using the "Save Page" function of your browser and diff-ing the HTML for 2 different loads. What is varying is possibly some sort of tracking ID. > what must I do to ensure that the whole page is read ? Nothing. Using .read() already ensures it. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list