Re: [Haskell-cafe] Downloading web page in Haskell
2010/11/20 José Romildo Malaquias j.romi...@gmail.com: In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. Any clues on how to solve this problem? My guess is that there's a character encoding issue. Another approach would be using the http-enumerator package[1]. The equivalent program is: module Main where import Network.HTTP.Enumerator (simpleHttp) import qualified Data.ByteString.Lazy as L main = do src - simpleHttp http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne; L.writeFile test.html src L.putStrLn src Michael [1] http://hackage.haskell.org/package/http-enumerator ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
michael: 2010/11/20 José Romildo Malaquias j.romi...@gmail.com: In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. Any clues on how to solve this problem? My guess is that there's a character encoding issue. Another approach would be using the http-enumerator package[1]. The equivalent program is: module Main where import Network.HTTP.Enumerator (simpleHttp) import qualified Data.ByteString.Lazy as L main = do src - simpleHttp http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne; L.writeFile test.html src L.putStrLn src FWIW, with this url, I get the same problem using the Curl package (via the download-curl): import Network.Curl.Download import qualified Data.ByteString as B main = do edoc - openURI http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne; case edoc of Left err - print err Right doc - B.writeFile test.html doc Not a problem on e.g. http://haskell.org -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
On Saturday 20 November 2010 21:47:52, Don Stewart wrote: 2010/11/20 José Romildo Malaquias j.romi...@gmail.com: In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. Any clues on how to solve this problem? FWIW, with this url, I get the same problem using the Curl package Just for the record, wget also gets a truncated (at the same point) file, so it's not a Haskell problem. Not a problem on e.g. http://haskell.org -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
On Sat, Nov 20, 2010 at 10:26:49PM +0100, Daniel Fischer wrote: On Saturday 20 November 2010 21:47:52, Don Stewart wrote: 2010/11/20 José Romildo Malaquias j.romi...@gmail.com: In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. Any clues on how to solve this problem? FWIW, with this url, I get the same problem using the Curl package Just for the record, wget also gets a truncated (at the same point) file, so it's not a Haskell problem. Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell. Romildo ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
José Romildo Malaquias wrote: Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell. Hi Romildo, This web page serves the head, including a lot of JavaScript, and the first few hundred bytes of the body, then pauses. That causes web browsers to begin loading and executing the JavaScript. Apparently, the site only continues serving the rest of the page if the JavaScript is actually loaded and executed. If not, it aborts. Either intentionally or unintentionally, that effectively prevents naive scripts from accessing the page. Cute technique. So if you don't want to honor the site author's intention not to allow scripts to load the page, try looking through the JavaScript and find out what you need to do to get the page to continue loading. However, if the site author is very determined to stop you, the JavaScript will be obfuscated or encrypted, which would make this an annoying task. Good luck, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
On Nov 20, 2010, at 5:10 PM, Yitzchak Gale wrote: José Romildo Malaquias wrote: Web browsers like Firefox and Opera does not seem to have the same problem with this web page. I would like to be able to download this page from Haskell. Hi Romildo, This web page serves the head, including a lot of JavaScript, and the first few hundred bytes of the body, then pauses. That causes web browsers to begin loading and executing the JavaScript. Apparently, the site only continues serving the rest of the page if the JavaScript is actually loaded and executed. If not, it aborts. Actually, I think it's just a misconfigured proxy. The curl executable fails, at the same point, but a curl --compressed call succeeds. The curl bindings don't allow you to automatically get and decompress gzip data, so you could either set the accept: gzip header yourself, then pipe the output through the appropriate decompression routine, or, more simply, just get the page via using System.Process to drive the curl binary directly. Cheers, Sterl___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
On 10-11-20 02:54 PM, José Romildo Malaquias wrote: In order to download a given web page, I wrote the attached program. The problem is that the page is not being full downloaded. It is being somehow intettupted. The specific website and url http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne truncates when the web server chooses the identity encoding (i.e., as opposed to compressed ones such as gzip). The server chooses identity when your request's Accept-Encoding field specifies identity or simply your request has no Accept-Encoding field, such as when you use simpleHTTP (getRequest url), curl, wget, elinks. When the server chooses gzip (its favourite), which is when your Accept-Encoding field includes gzip, the received data is complete (but then you have to gunzip it yourself). This happens with mainstream browsers and W3C's validator at validator.w3.org (which destroys the you need javascript hypothesis). I haven't tested other compressed encodings. Methodology My methodology of discovering and confirming this is a great lesson in the triumph of the scientific methodology (over the prevailing opinionative methodology, for example). The first step is to confirm or deny a Network.HTTP problem. For a maximally controlled experiment, I enter HTTP by hand using nc: $ nc www.adorocinema.com 80 GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1 Host: www.adorocinema.com blank line It still truncates, so at least Network.HTTP is not alone. I also try elinks. Other people try curl and wget for the same reason and the same result. The second step is to confirm or deny javascript magic. Actually the truncation strongly suggests that javascript is not involved: the truncation ends with an incomplete end-tag /. This is abnormal even for very buggy javascript-heavy web pages. To certainly deny javascript magic, I first try Firefox with javascript off (also java off, flash off, even css off), and then I also ask validator.w3.org to validate the page. Both receive complete data. Of course the validator is going to say many errors, but the point is that if the validator reports errors at locations way beyond our truncation point, then the validator sees data we don't see, and the validator doesn't even care about javascript. The validator may be very sophisticated in parsing html, but in sending an HTTP request it ought to be very simple-minded. The third step is to find out what extra thing the validator does to deserve complete data. So I try diagonalization: I give this CGI script to the validator: #! /bin/sh echo 'Content-Type: text/html' echo '' e=`env` cat EOF htmlheadtitletitle/title/headbodypre $e /pre/body/html EOF If I also tell the validator to show source, which means display what html code it sees, then I see the validator's request (barring web server support, but modern web servers probably support much more than the original minimal CGI specification). There are indeed quite a few extra header fields the validator sends, and I can try to mimic each of them. Eventually I find this crucial field: Accept-Encoding: gzip, x-gzip, deflate Further tests confirm that we just need gzip. Finally, to confirm the finding with a maximally controlled experiment, I enter the improved request by hand using nc, but this time I save the output in a file (so later I can decompress it): $ nc -q 10 www.adorocinema.com 80 save.gz GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1 Host: www.adorocinema.com Accept-Encoding: gzip blank line wait a while Now save.gz contains both header and body, and it only makes sense to uncompress the body. So edit save.gz to delete the header part. Applying gunzip to the body will give some unexpected end of file error. Don't despair. Do this instead: $ zcat save.gz save.html It is still an error but save.html has meaningful and complete content. You can examine it. You can load it in a web browser and see. At least, it is much longer, and it ends with /html rather than /. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
Albert Y. C. Lai wrote: ...truncates when the web server chooses the identity encoding The server chooses identity when your request's Accept-Encoding field specifies identity or simply your request has no Accept-Encoding field Excellent work! My methodology of discovering and confirming this is a great lesson in the triumph of the scientific methodology (over the prevailing opinionative methodology, for example). Haha, indeed! Actually the truncation strongly suggests that javascript is not involved: the truncation ends with an incomplete end-tag /. This is abnormal even for very buggy javascript-heavy web pages. Well, no, the theory was that the server sends some random number of bytes from the body to ensure that the browser starts loading the scripts in the head. So it could stop anywhere. In the end, I think you didn't really need the W3C validator. You also could have triangulated on the headers sent by your own browser. So, there you have it, folks. The Haskell community debugs a broken web server, without being asked, and without access to the server. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Downloading web page in Haskell
Most likely you also have the zlib package (cabal-install needs it), so let's use it. Attached therefore.hs import qualified Data.ByteString.Lazy as LB import Codec.Compression.GZip(decompress) import Network.URI(parseURI) import Network.HTTP url = http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne; -- steal from getRequest but change type -- it's lazy bytestring because the GZip library specifies it myGetRequest :: String - Request LB.ByteString myGetRequest s = case parseURI s of Nothing - error url syntax error Just uri - mkRequest GET uri main = do result - simpleHTTP (insertHeader HdrAcceptEncoding gzip (myGetRequest url)) case result of Left e - print e Right rsp - do let src = case findHeader HdrContentEncoding rsp of Nothing - rspBody rsp Just gzip - decompress (rspBody rsp) Just _ - error TODO: other decompressions LB.writeFile test.html src LB.putStrLn src ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe