Re: [Haskell-cafe] Downloading web page in Haskell

Albert Y. C. Lai Sat, 20 Nov 2010 16:12:15 -0800

On 10-11-20 02:54 PM, José Romildo Malaquias wrote:

In order to download a given web page, I wrote the attached program. The
problem is that the page is not being full downloaded. It is being
somehow intettupted.


The specific website and url
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne

truncates when the web server chooses the identity encoding (i.e., asopposed to compressed ones such as gzip). The server chooses identitywhen your request's Accept-Encoding field specifies identity or simplyyour request has no Accept-Encoding field, such as when you usesimpleHTTP (getRequest url), curl, wget, elinks.

When the server chooses gzip (its favourite), which is when yourAccept-Encoding field includes gzip, the received data is complete (butthen you have to gunzip it yourself). This happens with mainstreambrowsers and W3C's validator at validator.w3.org (which destroys the"you need javascript" hypothesis). I haven't tested other compressedencodings.


Methodology

My methodology of discovering and confirming this is a great lesson inthe triumph of the scientific methodology (over the prevailingopinionative methodology, for example).

The first step is to confirm or deny a Network.HTTP problem. For amaximally controlled experiment, I enter HTTP by hand using nc:


$ nc www.adorocinema.com 80
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
<blank line>

It still truncates, so at least Network.HTTP is not alone. I also tryelinks. Other people try curl and wget for the same reason and the sameresult.

The second step is to confirm or deny javascript magic. Actually thetruncation strongly suggests that javascript is not involved: thetruncation ends with an incomplete end-tag "</". This is abnormal evenfor very buggy javascript-heavy web pages. To certainly deny javascriptmagic, I first try Firefox with javascript off (also java off, flashoff, even css off), and then I also ask validator.w3.org to validate thepage. Both receive complete data. Of course the validator is going tosay "many errors", but the point is that if the validator reports errorsat locations way beyond our truncation point, then the validator seesdata we don't see, and the validator doesn't even care about javascript.

The validator may be very sophisticated in parsing html, but in sendingan HTTP request it ought to be very simple-minded. The third step is tofind out what extra thing the validator does to deserve complete data.So I try diagonalization: I give this CGI script to the validator:


#! /bin/sh

echo 'Content-Type: text/html'
echo ''
e=`env`
cat <<EOF
<html><head><title>title</title></head><body><pre>
$e
</pre></body></html>
EOF

If I also tell the validator to "show source", which means display whathtml code it sees, then I see the validator's request (barring webserver support, but modern web servers probably support much more thanthe original minimal CGI specification). There are indeed quite a fewextra header fields the validator sends, and I can try to mimic each ofthem. Eventually I find this crucial field:


Accept-Encoding: gzip, x-gzip, deflate

Further tests confirm that we just need gzip.

Finally, to confirm the finding with a maximally controlled experiment,I enter the improved request by hand using nc, but this time I save theoutput in a file (so later I can decompress it):


$ nc -q 10 www.adorocinema.com 80 > save.gz
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
Accept-Encoding: gzip
<blank line>
<wait a while>

Now save.gz contains both header and body, and it only makes sense touncompress the body. So edit save.gz to delete the header part.

Applying gunzip to the body will give some "unexpected end of file"error. Don't despair. Do this instead:


$ zcat save.gz > save.html

It is still an error but save.html has meaningful and complete content.You can examine it. You can load it in a web browser and see. At least,it is much longer, and it ends with "</html>" rather than "</".


_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Downloading web page in Haskell

Reply via email to