>>>>> jitu <nair.jiten...@gmail.com> (j) wrote:

>j> Hi,
>j> A html page  contains 'anchor' elements with 'href' attribute  having
>j> a semicolon  in the url , while fetching the page using
>j> urllib2.urlopen, all such href's  containing  'semicolons' are
>j> truncated.


>j> For example the href 
>http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
>j> get truncated to 
>http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

>j> The page I am talking about can be fetched from
>j> 
>http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--

It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).

To get them you have to tell the server that you are a respectable
browser. E.g.

import urllib2

url = 
'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'

url = 
'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--'

hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; 
rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
       'Accept': 'image/*'}

request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()

-- 
Piet van Oostrum <p...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to