Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Michael Snoyman
2010/11/20 José Romildo Malaquias j.romi...@gmail.com:
 In order to download a given web page, I wrote the attached program. The
 problem is that the page is not being full downloaded. It is being
 somehow intettupted.

 Any clues on how to solve this problem?

My guess is that there's a character encoding issue. Another approach
would be using the http-enumerator package[1]. The equivalent program
is:

module Main where

import Network.HTTP.Enumerator (simpleHttp)
import qualified Data.ByteString.Lazy as L

main =
  do src - simpleHttp
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne;
 L.writeFile test.html src
 L.putStrLn src

Michael

[1] http://hackage.haskell.org/package/http-enumerator
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Don Stewart
michael:
 2010/11/20 José Romildo Malaquias j.romi...@gmail.com:
  In order to download a given web page, I wrote the attached program. The
  problem is that the page is not being full downloaded. It is being
  somehow intettupted.
 
  Any clues on how to solve this problem?
 
 My guess is that there's a character encoding issue. Another approach
 would be using the http-enumerator package[1]. The equivalent program
 is:
 
 module Main where
 
 import Network.HTTP.Enumerator (simpleHttp)
 import qualified Data.ByteString.Lazy as L
 
 main =
   do src - simpleHttp
 http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne;
  L.writeFile test.html src
  L.putStrLn src
 


FWIW, with this url, I get the same problem using the Curl package (via the 
download-curl):

import Network.Curl.Download
import qualified Data.ByteString as B

main = do
edoc - openURI 
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne;
case edoc of
Left err  - print err
Right doc - B.writeFile test.html doc
 

Not a problem on e.g. http://haskell.org

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Daniel Fischer
On Saturday 20 November 2010 21:47:52, Don Stewart wrote:
  2010/11/20 José Romildo Malaquias j.romi...@gmail.com:
   In order to download a given web page, I wrote the attached program.
   The problem is that the page is not being full downloaded. It is
   being somehow intettupted.
  
   Any clues on how to solve this problem?

 FWIW, with this url, I get the same problem using the Curl package

Just for the record, wget also gets a truncated (at the same point) file, 
so it's not a Haskell problem.


 Not a problem on e.g. http://haskell.org

 -- Don

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread José Romildo Malaquias
On Sat, Nov 20, 2010 at 10:26:49PM +0100, Daniel Fischer wrote:
 On Saturday 20 November 2010 21:47:52, Don Stewart wrote:
   2010/11/20 José Romildo Malaquias j.romi...@gmail.com:
In order to download a given web page, I wrote the attached program.
The problem is that the page is not being full downloaded. It is
being somehow intettupted.
   
Any clues on how to solve this problem?
 
  FWIW, with this url, I get the same problem using the Curl package
 
 Just for the record, wget also gets a truncated (at the same point) file, 
 so it's not a Haskell problem.

Web browsers like Firefox and Opera does not seem to have the same
problem with this web page.

I would like to be able to download this page from Haskell.

Romildo
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Yitzchak Gale
José Romildo Malaquias wrote:
 Web browsers like Firefox and Opera does not seem to have the same
 problem with this web page.
 I would like to be able to download this page from Haskell.

Hi Romildo,

This web page serves the head, including a lot of JavaScript,
and the first few hundred bytes of the body, then pauses.
That causes web browsers to begin loading and executing
the JavaScript. Apparently, the site only continues serving
the rest of the page if the JavaScript is actually loaded and
executed. If not, it aborts.

Either intentionally or unintentionally, that effectively prevents
naive scripts from accessing the page. Cute technique.

So if you don't want to honor the site author's intention not
to allow scripts to load the page, try looking through the
JavaScript and find out what you need to do to get the page to
continue loading. However, if the site author is very determined
to stop you, the JavaScript will be obfuscated or encrypted,
which would make this an annoying task.

Good luck,
Yitz
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Sterling Clover

On Nov 20, 2010, at 5:10 PM, Yitzchak Gale wrote:

 José Romildo Malaquias wrote:
 Web browsers like Firefox and Opera does not seem to have the same
 problem with this web page.
 I would like to be able to download this page from Haskell.
 
 Hi Romildo,
 
 This web page serves the head, including a lot of JavaScript,
 and the first few hundred bytes of the body, then pauses.
 That causes web browsers to begin loading and executing
 the JavaScript. Apparently, the site only continues serving
 the rest of the page if the JavaScript is actually loaded and
 executed. If not, it aborts.

Actually, I think it's just a misconfigured proxy. The curl executable fails, 
at the same point, but a curl --compressed call succeeds. The curl bindings 
don't allow you to automatically get and decompress gzip data, so you could 
either set the accept: gzip header yourself, then pipe the output through the 
appropriate decompression routine, or, more simply, just get the page via using 
System.Process to drive the curl binary directly.

Cheers,
Sterl___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Albert Y. C. Lai

On 10-11-20 02:54 PM, José Romildo Malaquias wrote:

In order to download a given web page, I wrote the attached program. The
problem is that the page is not being full downloaded. It is being
somehow intettupted.


The specific website and url
http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne

truncates when the web server chooses the identity encoding (i.e., as 
opposed to compressed ones such as gzip). The server chooses identity 
when your request's Accept-Encoding field specifies identity or simply 
your request has no Accept-Encoding field, such as when you use 
simpleHTTP (getRequest url), curl, wget, elinks.


When the server chooses gzip (its favourite), which is when your 
Accept-Encoding field includes gzip, the received data is complete (but 
then you have to gunzip it yourself). This happens with mainstream 
browsers and W3C's validator at validator.w3.org (which destroys the 
you need javascript hypothesis). I haven't tested other compressed 
encodings.


Methodology

My methodology of discovering and confirming this is a great lesson in 
the triumph of the scientific methodology (over the prevailing 
opinionative methodology, for example).


The first step is to confirm or deny a Network.HTTP problem. For a 
maximally controlled experiment, I enter HTTP by hand using nc:


$ nc www.adorocinema.com 80
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
blank line

It still truncates, so at least Network.HTTP is not alone. I also try 
elinks. Other people try curl and wget for the same reason and the same 
result.


The second step is to confirm or deny javascript magic. Actually the 
truncation strongly suggests that javascript is not involved: the 
truncation ends with an incomplete end-tag /. This is abnormal even 
for very buggy javascript-heavy web pages. To certainly deny javascript 
magic, I first try Firefox with javascript off (also java off, flash 
off, even css off), and then I also ask validator.w3.org to validate the 
page. Both receive complete data. Of course the validator is going to 
say many errors, but the point is that if the validator reports errors 
at locations way beyond our truncation point, then the validator sees 
data we don't see, and the validator doesn't even care about javascript.


The validator may be very sophisticated in parsing html, but in sending 
an HTTP request it ought to be very simple-minded. The third step is to 
find out what extra thing the validator does to deserve complete data. 
So I try diagonalization: I give this CGI script to the validator:


#! /bin/sh

echo 'Content-Type: text/html'
echo ''
e=`env`
cat EOF
htmlheadtitletitle/title/headbodypre
$e
/pre/body/html
EOF

If I also tell the validator to show source, which means display what 
html code it sees, then I see the validator's request (barring web 
server support, but modern web servers probably support much more than 
the original minimal CGI specification). There are indeed quite a few 
extra header fields the validator sends, and I can try to mimic each of 
them. Eventually I find this crucial field:


Accept-Encoding: gzip, x-gzip, deflate

Further tests confirm that we just need gzip.

Finally, to confirm the finding with a maximally controlled experiment, 
I enter the improved request by hand using nc, but this time I save the 
output in a file (so later I can decompress it):


$ nc -q 10 www.adorocinema.com 80  save.gz
GET /common/search/search_by_film/?criteria=Bourne HTTP/1.1
Host: www.adorocinema.com
Accept-Encoding: gzip
blank line
wait a while

Now save.gz contains both header and body, and it only makes sense to 
uncompress the body. So edit save.gz to delete the header part.


Applying gunzip to the body will give some unexpected end of file 
error. Don't despair. Do this instead:


$ zcat save.gz  save.html

It is still an error but save.html has meaningful and complete content. 
You can examine it. You can load it in a web browser and see. At least, 
it is much longer, and it ends with /html rather than /.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Yitzchak Gale
Albert Y. C. Lai wrote:
 ...truncates when the web server chooses the identity encoding
 The server chooses identity when
 your request's Accept-Encoding field specifies identity or simply your
 request has no Accept-Encoding field

Excellent work!

 My methodology of discovering and confirming this is a great lesson in the
 triumph of the scientific methodology (over the prevailing opinionative
 methodology, for example).

Haha, indeed!

 Actually the
 truncation strongly suggests that javascript is not involved: the truncation
 ends with an incomplete end-tag /. This is abnormal even for very buggy
 javascript-heavy web pages.

Well, no, the theory was that the server sends some random
number of bytes from the body to ensure that the browser
starts loading the scripts in the head. So it could stop anywhere.

In the end, I think you didn't really need the W3C validator.
You also could have triangulated on the headers sent by your
own browser.

So, there you have it, folks. The Haskell community debugs
a broken web server, without being asked, and without access
to the server.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Downloading web page in Haskell

2010-11-20 Thread Albert Y. C. Lai
Most likely you also have the zlib package (cabal-install needs it), so 
let's use it. Attached therefore.hs


import qualified Data.ByteString.Lazy as LB
import Codec.Compression.GZip(decompress)
import Network.URI(parseURI)
import Network.HTTP

url = http://www.adorocinema.com/common/search/search_by_film/?criteria=Bourne;

-- steal from getRequest but change type
-- it's lazy bytestring because the GZip library specifies it
myGetRequest :: String - Request LB.ByteString
myGetRequest s = case parseURI s of
  Nothing - error url syntax error
  Just uri - mkRequest GET uri

main = do
  result - simpleHTTP (insertHeader HdrAcceptEncoding gzip (myGetRequest url))
  case result of
Left e - print e
Right rsp - do
  let src = case findHeader HdrContentEncoding rsp of
Nothing - rspBody rsp
Just gzip - decompress (rspBody rsp)
Just _ - error TODO: other decompressions
  LB.writeFile test.html src
  LB.putStrLn src
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe