On 16 July 2017 at 11:26, Javier Bezos <jbezos.du...@gmail.com> wrote: > Google News used to fail with the high level functions provided by httplib > and the like. However, I found this piece of code somewhere: > > def gopen(): > http = httplib.HTTPSConnection('news.google.com') > http.request("GET","/news?ned=es_MX" , > headers = > {"User-Agent":"Mozilla/5.0 (X11; U; Linux i686; es-MX) > AppleWebKit/532.8 (KHTML, like Gecko) Chrome/4.0.277.0 Safari/532.8", > "Host":'news.google.com', > "Accept": "*/*"}) > return http.getresponse() > > A few days ago, Google News has been revamped and it doesn't work any more > (2.6/Win7, 2.7/OSX and, with minimal changes, 3.6/Win7), because the page > contents is empty. The code itself doesn't raise any errors. Which is the > proper way to do it now? I must stick to the standard libraries.
Why? The Python standard library doesn’t have anything good for HTTP. * httplib is fairly low-level, and it does not support something as basic as redirects; * urllib.request (urllib2 in Python 2) is slightly better; * but even the official docs for both redirect to requests: http://docs.python-requests.org/en/master/ for a high level interface. (Also, please upgrade your Windows box to run Python 2.7.) > The returned headers are: > > ---------------------- > [('Content-Type', 'application/binary'), > ('Cache-Control', 'no-cache, no-store, max-age=0, must-revalidate'), > ('Pragma', 'no-cache'), > ('Expires', 'Mon, 01 Jan 1990 00:00:00 GMT'), > ('Date', 'Thu, 13 Jul 2017 16:37:48 GMT'), > ('Location', 'https://news.google.com/news/?ned=es_mx&hl=es'), > ('Strict-Transport-Security', 'max-age=10886400'), > ('P3P', > 'CP="This is not a P3P policy! See ' > 'https://support.google.com/accounts/answer/151657?hl=en for more > info."'), > ('Server', 'ESF'), > ('Content-Length', '0'), > ('X-XSS-Protection', '1; mode=block'), > ('X-Frame-Options', 'SAMEORIGIN'), > ('X-Content-Type-Options', 'nosniff'), > ('Set-Cookie', > 'NID=107=qwH7N2hB12zVGfFzrAC2CZZNhrnNAVLEmTvDvuSzzw6mSlta9D2RDZVP9t5gEcq_WJjZQjDSWklJ7LElSnAZnHsiF4CXOwvGDs2tjrXfP41LE-6LafdA86GO3sWYnfWs;Domain=.google.com;Path=/;Expires=Fri, > ' > '12-Jan-2018 16:37:48 GMT;HttpOnly'), > ('Alt-Svc', 'quic=":443"; ma=2592000; v="39,38,37,36,35"')] > ----------------------- > > `read()` is empty string ('' or b''). `status` is 302. `reason` is `Found`. https://en.wikipedia.org/wiki/HTTP_302 See that Location header? The web server wants to redirect you somewhere. Your low-level HTTP library does not handle redirects automatically, so you’d need to take care of that yourself. -- Chris Warrick <https://chriswarrick.com/> PGP: 5EAAEA16 -- https://mail.python.org/mailman/listinfo/python-list