subject:"\[Haskell\-cafe\] Converting wiki pages into pdf"

Re: [Haskell-cafe] Converting wiki pages into pdf

2012-05-24 Thread Dirk Hünniger

I invested an enormous amount of time into this problem. Accordingly I 
got a very well working solution.


http://de.wikibooks.org/wiki/Benutzer:Dirk_Huenniger/wb2pdf
http://en.wikibooks.org/wiki/File:Haskell.pdf

I am happy If you find it useful.
Yours Dirk Hünniger


Thu, 08 Sep 2011 05:36:44 -0700
Hello all
I am trying to write a Haskell program which download html pages from
wikipedia   including images and convert them into pdf . I wrote a
small script

import Network.HTTP
import Data.Maybe
import Data.List

main = do
 x<- getLine
 htmlpage<-  getResponseBody =<<  simpleHTTP ( getRequest x ) --
open url
 --print.words $ htmlpage
 let ind_1 = fromJust . ( \n ->  findIndex ( n `isPrefixOf`) .
tails $ htmlpage ) $ ""
 ind_2 = fromJust . ( \n ->  findIndex ( n `isPrefixOf`) .
tails $ htmlpage ) $ ""
 tmphtml = drop ind_1 $ take ind_2  htmlpage
 writeFile "down.html" tmphtml

and its working fine except some symbols are not rendering as it
should be. Could some one please suggest me how to accomplish this
task.

Thank you
Mukesh Tiwari



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-09 Thread Michael Snoyman

On Fri, Sep 9, 2011 at 3:16 PM, mukesh tiwari
 wrote:
>
> Thank you all for replying. I managed to write a python script. It depends
> on PyQt4 . I am curious if we have any thing like PyQt4  in Haskell.
>
> import sys
> from PyQt4.QtCore import *
> from PyQt4.QtGui import *
> from PyQt4.QtWebKit import *
>
> #http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
> #http://pastebin.com/xunfQ959
> #http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
> #http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html
>
> def convertFile( ):
>     web.print_( printer )
>     print "done"
>     QApplication.exit()
>
>
> if __name__=="__main__":
>     url = raw_input("enter url:")
>     filename = raw_input("enter file name:")
>     app = QApplication( sys.argv )
>     web = QWebView()
>     web.load(QUrl( url ))
>     #web.show()
>     printer = QPrinter( QPrinter.HighResolution )
>     printer.setPageSize( QPrinter.A4 )
>     printer.setOutputFormat( QPrinter.PdfFormat )
>     printer.setOutputFileName(  filename + ".pdf" )
>     QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
>     sys.exit(app.exec_())
>
>
> On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas  wrote:
>>
>> The whole wikipedia database can also be downloaded if that is any help.
>>
>> http://en.wikipedia.org/wiki/Wikipedia:Database_download
>>
>> There is also text in that site saying "Please do not use a web
>> crawler to download large numbers of articles. Aggressive crawling of
>> the server can cause a dramatic slow-down of Wikipedia."
>>
>> Matti
>>
>> 2011/9/9 Kyle Murphy :
>> > It's worth pointing out at this point (as alluded to by Conrad) that
>> > what
>> > you're attempting might be considered somewhat rude, and possibly
>> > slightly
>> > illegal (depending on the insanity of the legal system in question).
>> > Automated site scraping (what you're essentially doing) is generally
>> > frowned
>> > upon by most hosts unless it follows some very specific guidelines,
>> > usually
>> > at a minimum respecting the restrictions specified in the robots.txt
>> > file
>> > contained in the domains root. Furthermore, depending on the type of
>> > data in
>> > question, and if a EULA was agreed to if the site requires an account,
>> > doing
>> > any kind of automated processing might be disallowed. Now, I think
>> > wikipedia
>> > has a fairly lenient policy, or at least I hope it does considering it's
>> > community driven, but depending on how much of wikipedia you're planning
>> > on
>> > crawling you might at the very least consider severly throttling the
>> > process
>> > to keep from sucking up too much bandwidth.
>> >
>> > On the topic of how to actually perform that crawl, you should probably
>> > check out the format of the link provided in the download PDF element.
>> > After
>> > looking at an article (note, I'm basing this off a quick glance at a
>> > single
>> > page) it looks like you should be able to modify the URL provided in the
>> > "Permanent link" element to generate the PDF link by changing the title
>> > argument to arttitle, adding a new title argument with the value
>> > "Special:Book", and adding the new arguments "bookcmd=render_article"
>> > and
>> > "writer=rl". For example if the permanent link to the article is:
>> >
>> > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
>> >
>> > Then the PDF URL is:
>> >
>> >
>> > http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
>> >
>> > This is all rather hacky as well, and none of it has been tested so it
>> > might
>> > not actually work, although I see no reason why it shouldn't. It's also
>> > fragile, as if wikipedia changes just about anything it could all brake,
>> > but
>> > that's the risk you run anytime you resort of site scraping.
>> >
>> > -R. Kyle Murphy
>> > --
>> > Curiosity was framed, Ignorance killed the cat.
>> >
>> >
>> > On Thu, Sep 8, 2011 at 23:40, Conrad Parker 
>> > wrote:
>> >>
>> >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
>> >> wrote:
>> >> >
>> >> > Thank your for reply Daniel. Considering my limited knowledge of web
>> >> > programming and javascript , first i need to simulated the some sort
>> >> > of
>> >> > browser in my program which will run the javascript and will generate
>> >> > the
>> >> > pdf. After that i can download the pdf . Is this you mean ?  Is
>> >> > Network.Browser any helpful for this purpose ? Is there  way to solve
>> >> > this
>> >> > problem ?
>> >> > Sorry for  many questions but this  is my first web application
>> >> > program
>> >> > and i am trying hard to finish it.
>> >> >
>> >>
>> >> Have you tried finding out if simple URLs exist for this, that don't
>> >> require Javascript? Does Wikipedia have a policy on this?
>> >>
>> >> Conrad.
>> >>
>> >> >
>> >> > On Fri, Sep 9,

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-09 Thread mukesh tiwari

Thank you all for replying. I managed to write a python script. It depends
on PyQt4 . I am curious if we have any thing like PyQt4  in Haskell.

import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

#http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/
#http://pastebin.com/xunfQ959
#
http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/
#http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html

def convertFile( ):
web.print_( printer )
print "done"
QApplication.exit()


if __name__=="__main__":
url = raw_input("enter url:")
filename = raw_input("enter file name:")
app = QApplication( sys.argv )
web = QWebView()
web.load(QUrl( url ))
#web.show()
printer = QPrinter( QPrinter.HighResolution )
printer.setPageSize( QPrinter.A4 )
printer.setOutputFormat( QPrinter.PdfFormat )
printer.setOutputFileName(  filename + ".pdf" )
QObject.connect( web ,  SIGNAL("loadFinished(bool)"), convertFile  )
sys.exit(app.exec_())


On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas  wrote:

> The whole wikipedia database can also be downloaded if that is any help.
>
> http://en.wikipedia.org/wiki/Wikipedia:Database_download
>
> There is also text in that site saying "Please do not use a web
> crawler to download large numbers of articles. Aggressive crawling of
> the server can cause a dramatic slow-down of Wikipedia."
>
> Matti
>
> 2011/9/9 Kyle Murphy :
> > It's worth pointing out at this point (as alluded to by Conrad) that what
> > you're attempting might be considered somewhat rude, and possibly
> slightly
> > illegal (depending on the insanity of the legal system in question).
> > Automated site scraping (what you're essentially doing) is generally
> frowned
> > upon by most hosts unless it follows some very specific guidelines,
> usually
> > at a minimum respecting the restrictions specified in the robots.txt file
> > contained in the domains root. Furthermore, depending on the type of data
> in
> > question, and if a EULA was agreed to if the site requires an account,
> doing
> > any kind of automated processing might be disallowed. Now, I think
> wikipedia
> > has a fairly lenient policy, or at least I hope it does considering it's
> > community driven, but depending on how much of wikipedia you're planning
> on
> > crawling you might at the very least consider severly throttling the
> process
> > to keep from sucking up too much bandwidth.
> >
> > On the topic of how to actually perform that crawl, you should probably
> > check out the format of the link provided in the download PDF element.
> After
> > looking at an article (note, I'm basing this off a quick glance at a
> single
> > page) it looks like you should be able to modify the URL provided in the
> > "Permanent link" element to generate the PDF link by changing the title
> > argument to arttitle, adding a new title argument with the value
> > "Special:Book", and adding the new arguments "bookcmd=render_article" and
> > "writer=rl". For example if the permanent link to the article is:
> >
> > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
> >
> > Then the PDF URL is:
> >
> >
> http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
> >
> > This is all rather hacky as well, and none of it has been tested so it
> might
> > not actually work, although I see no reason why it shouldn't. It's also
> > fragile, as if wikipedia changes just about anything it could all brake,
> but
> > that's the risk you run anytime you resort of site scraping.
> >
> > -R. Kyle Murphy
> > --
> > Curiosity was framed, Ignorance killed the cat.
> >
> >
> > On Thu, Sep 8, 2011 at 23:40, Conrad Parker 
> wrote:
> >>
> >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
> >> wrote:
> >> >
> >> > Thank your for reply Daniel. Considering my limited knowledge of web
> >> > programming and javascript , first i need to simulated the some sort
> of
> >> > browser in my program which will run the javascript and will generate
> the
> >> > pdf. After that i can download the pdf . Is this you mean ?  Is
> >> > Network.Browser any helpful for this purpose ? Is there  way to solve
> this
> >> > problem ?
> >> > Sorry for  many questions but this  is my first web application
> program
> >> > and i am trying hard to finish it.
> >> >
> >>
> >> Have you tried finding out if simple URLs exist for this, that don't
> >> require Javascript? Does Wikipedia have a policy on this?
> >>
> >> Conrad.
> >>
> >> >
> >> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
> >> >  wrote:
> >> >>
> >> >> It looks to me that the link is generated by javascript, so unless
> you
> >> >> can script an actual browser into the loop, it may not be a viable
> approach.
> >> >>
> >> >> On Sep 8, 2011, at 3:57 PM, muke

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread Matti Oinas

The whole wikipedia database can also be downloaded if that is any help.

http://en.wikipedia.org/wiki/Wikipedia:Database_download

There is also text in that site saying "Please do not use a web
crawler to download large numbers of articles. Aggressive crawling of
the server can cause a dramatic slow-down of Wikipedia."

Matti

2011/9/9 Kyle Murphy :
> It's worth pointing out at this point (as alluded to by Conrad) that what
> you're attempting might be considered somewhat rude, and possibly slightly
> illegal (depending on the insanity of the legal system in question).
> Automated site scraping (what you're essentially doing) is generally frowned
> upon by most hosts unless it follows some very specific guidelines, usually
> at a minimum respecting the restrictions specified in the robots.txt file
> contained in the domains root. Furthermore, depending on the type of data in
> question, and if a EULA was agreed to if the site requires an account, doing
> any kind of automated processing might be disallowed. Now, I think wikipedia
> has a fairly lenient policy, or at least I hope it does considering it's
> community driven, but depending on how much of wikipedia you're planning on
> crawling you might at the very least consider severly throttling the process
> to keep from sucking up too much bandwidth.
>
> On the topic of how to actually perform that crawl, you should probably
> check out the format of the link provided in the download PDF element. After
> looking at an article (note, I'm basing this off a quick glance at a single
> page) it looks like you should be able to modify the URL provided in the
> "Permanent link" element to generate the PDF link by changing the title
> argument to arttitle, adding a new title argument with the value
> "Special:Book", and adding the new arguments "bookcmd=render_article" and
> "writer=rl". For example if the permanent link to the article is:
>
> http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269
>
> Then the PDF URL is:
>
> http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl
>
> This is all rather hacky as well, and none of it has been tested so it might
> not actually work, although I see no reason why it shouldn't. It's also
> fragile, as if wikipedia changes just about anything it could all brake, but
> that's the risk you run anytime you resort of site scraping.
>
> -R. Kyle Murphy
> --
> Curiosity was framed, Ignorance killed the cat.
>
>
> On Thu, Sep 8, 2011 at 23:40, Conrad Parker  wrote:
>>
>> On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
>> wrote:
>> >
>> > Thank your for reply Daniel. Considering my limited knowledge of web
>> > programming and javascript , first i need to simulated the some sort of
>> > browser in my program which will run the javascript and will generate the
>> > pdf. After that i can download the pdf . Is this you mean ?  Is
>> > Network.Browser any helpful for this purpose ? Is there  way to solve this
>> > problem ?
>> > Sorry for  many questions but this  is my first web application program
>> > and i am trying hard to finish it.
>> >
>>
>> Have you tried finding out if simple URLs exist for this, that don't
>> require Javascript? Does Wikipedia have a policy on this?
>>
>> Conrad.
>>
>> >
>> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson
>> >  wrote:
>> >>
>> >> It looks to me that the link is generated by javascript, so unless you
>> >> can script an actual browser into the loop, it may not be a viable 
>> >> approach.
>> >>
>> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
>> >>
>> >> > I tried to use the PDF-generation facilities . I wrote a script which
>> >> > generates the rendering url . When i am pasting rendering url in
>> >> > browser its generating the download file but when i am trying to get
>> >> > the tags , its empty. Could some one please tell me what is wrong
>> >> > with
>> >> > code.
>> >> > Thank You
>> >> > Mukesh Tiwari
>> >> >
>> >> > import Network.HTTP
>> >> > import Text.HTML.TagSoup
>> >> > import Data.Maybe
>> >> >
>> >> > parseHelp :: Tag String -> Maybe String
>> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b ==
>> >> > "Download
>> >> > a PDF version of this wiki page" ) y )  /= []
>> >> >                            then Just $  "http://en.wikipedia.org"; ++
>> >> >  ( snd $
>> >> > y !!  0 )
>> >> >                             else Nothing
>> >> >
>> >> >
>> >> > parse :: [ Tag String ] -> Maybe String
>> >> > parse [] = Nothing
>> >> > parse ( x : xs )
>> >> >   | isTagOpen x = case parseHelp x of
>> >> >                        Just s -> Just s
>> >> >                        Nothing -> parse xs
>> >> >   | otherwise = parse xs
>> >> >
>> >> >
>> >> > main = do
>> >> >       x <- getLine
>> >> >       tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> >> > ( getRequest x ) --open url
>> >> >       let lst =  head . sections ( ~== "> >> > print_export>" ) $ tags_1
>> >> >

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread Kyle Murphy

It's worth pointing out at this point (as alluded to by Conrad) that what
you're attempting might be considered somewhat rude, and possibly slightly
illegal (depending on the insanity of the legal system in question).
Automated site scraping (what you're essentially doing) is generally frowned
upon by most hosts unless it follows some very specific guidelines, usually
at a minimum respecting the restrictions specified in the robots.txt file
contained in the domains root. Furthermore, depending on the type of data in
question, and if a EULA was agreed to if the site requires an account, doing
any kind of automated processing might be disallowed. Now, I think wikipedia
has a fairly lenient policy, or at least I hope it does considering it's
community driven, but depending on how much of wikipedia you're planning on
crawling you might at the very least consider severly throttling the process
to keep from sucking up too much bandwidth.

On the topic of how to actually perform that crawl, you should probably
check out the format of the link provided in the download PDF element. After
looking at an article (note, I'm basing this off a quick glance at a single
page) it looks like you should be able to modify the URL provided in the
"Permanent link" element to generate the PDF link by changing the title
argument to arttitle, adding a new title argument with the value
"Special:Book", and adding the new arguments "bookcmd=render_article" and
"writer=rl". For example if the permanent link to the article is:

http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269

Then the PDF URL is:

http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl

This is all rather hacky as well, and none of it has been tested so it might
not actually work, although I see no reason why it shouldn't. It's also
fragile, as if wikipedia changes just about anything it could all brake, but
that's the risk you run anytime you resort of site scraping.

-R. Kyle Murphy
--
Curiosity was framed, Ignorance killed the cat.

On Thu, Sep 8, 2011 at 23:40, Conrad Parker  wrote:

>
> On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
> wrote:
> >
> > Thank your for reply Daniel. Considering my limited knowledge of web
> programming and javascript , first i need to simulated the some sort of
> browser in my program which will run the javascript and will generate the
> pdf. After that i can download the pdf . Is this you mean ?  Is
> Network.Browser any helpful for this purpose ? Is there  way to solve this
> problem ?
> > Sorry for  many questions but this  is my first web application program
> and i am trying hard to finish it.
> >
>
> Have you tried finding out if simple URLs exist for this, that don't
> require Javascript? Does Wikipedia have a policy on this?
>
> Conrad.
>
> >
> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson <
> lists.hask...@dbp.mm.st> wrote:
> >>
> >> It looks to me that the link is generated by javascript, so unless you
> can script an actual browser into the loop, it may not be a viable approach.
> >>
> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
> >>
> >> > I tried to use the PDF-generation facilities . I wrote a script which
> >> > generates the rendering url . When i am pasting rendering url in
> >> > browser its generating the download file but when i am trying to get
> >> > the tags , its empty. Could some one please tell me what is wrong with
> >> > code.
> >> > Thank You
> >> > Mukesh Tiwari
> >> >
> >> > import Network.HTTP
> >> > import Text.HTML.TagSoup
> >> > import Data.Maybe
> >> >
> >> > parseHelp :: Tag String -> Maybe String
> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == "Download
> >> > a PDF version of this wiki page" ) y )  /= []
> >> >then Just $  "http://en.wikipedia.org"; ++
>  ( snd $
> >> > y !!  0 )
> >> > else Nothing
> >> >
> >> >
> >> > parse :: [ Tag String ] -> Maybe String
> >> > parse [] = Nothing
> >> > parse ( x : xs )
> >> >   | isTagOpen x = case parseHelp x of
> >> >Just s -> Just s
> >> >Nothing -> parse xs
> >> >   | otherwise = parse xs
> >> >
> >> >
> >> > main = do
> >> >   x <- getLine
> >> >   tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> >> > ( getRequest x ) --open url
> >> >   let lst =  head . sections ( ~== " >> > print_export>" ) $ tags_1
> >> >   url =  fromJust . parse $ lst  --rendering url
> >> >   putStrLn url
> >> >   tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> >> > ( getRequest url )
> >> >   print tags_2
> >> >
> >> >
> >> >
> >> >
> >> > ___
> >> > Haskell-Cafe mailing list
> >> > Haskell-Cafe@haskell.org
> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >>
> >
> >
> > ___
> > Haskell-Cafe mailing list
> > Has

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread Conrad Parker

On Sep 9, 2011 7:33 AM, "mukesh tiwari" 
wrote:
>
> Thank your for reply Daniel. Considering my limited knowledge of web
programming and javascript , first i need to simulated the some sort of
browser in my program which will run the javascript and will generate the
pdf. After that i can download the pdf . Is this you mean ?  Is
Network.Browser any helpful for this purpose ? Is there  way to solve this
problem ?
> Sorry for  many questions but this  is my first web application program
and i am trying hard to finish it.
>

Have you tried finding out if simple URLs exist for this, that don't require
Javascript? Does Wikipedia have a policy on this?

Conrad.

>
> On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson 
wrote:
>>
>> It looks to me that the link is generated by javascript, so unless you
can script an actual browser into the loop, it may not be a viable approach.
>>
>> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
>>
>> > I tried to use the PDF-generation facilities . I wrote a script which
>> > generates the rendering url . When i am pasting rendering url in
>> > browser its generating the download file but when i am trying to get
>> > the tags , its empty. Could some one please tell me what is wrong with
>> > code.
>> > Thank You
>> > Mukesh Tiwari
>> >
>> > import Network.HTTP
>> > import Text.HTML.TagSoup
>> > import Data.Maybe
>> >
>> > parseHelp :: Tag String -> Maybe String
>> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == "Download
>> > a PDF version of this wiki page" ) y )  /= []
>> >then Just $  "http://en.wikipedia.org"; ++  (
snd $
>> > y !!  0 )
>> > else Nothing
>> >
>> >
>> > parse :: [ Tag String ] -> Maybe String
>> > parse [] = Nothing
>> > parse ( x : xs )
>> >   | isTagOpen x = case parseHelp x of
>> >Just s -> Just s
>> >Nothing -> parse xs
>> >   | otherwise = parse xs
>> >
>> >
>> > main = do
>> >   x <- getLine
>> >   tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> > ( getRequest x ) --open url
>> >   let lst =  head . sections ( ~== "> > print_export>" ) $ tags_1
>> >   url =  fromJust . parse $ lst  --rendering url
>> >   putStrLn url
>> >   tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
>> > ( getRequest url )
>> >   print tags_2
>> >
>> >
>> >
>> >
>> > ___
>> > Haskell-Cafe mailing list
>> > Haskell-Cafe@haskell.org
>> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread mukesh tiwari

Thank your for reply Daniel. Considering my limited knowledge of web
programming and javascript , first i need to simulated the some sort of
browser in my program which will run the javascript and will generate the
pdf. After that i can download the pdf . Is this you mean ?  Is
Network.Browser any helpful for this purpose ? Is there  way to solve this
problem ?
Sorry for  many questions but this  is my first web application program and
i am trying hard to finish it.

Thank you
Mukesh Tiwari

On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson wrote:

> It looks to me that the link is generated by javascript, so unless you can
> script an actual browser into the loop, it may not be a viable approach.
>
> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:
>
> > I tried to use the PDF-generation facilities . I wrote a script which
> > generates the rendering url . When i am pasting rendering url in
> > browser its generating the download file but when i am trying to get
> > the tags , its empty. Could some one please tell me what is wrong with
> > code.
> > Thank You
> > Mukesh Tiwari
> >
> > import Network.HTTP
> > import Text.HTML.TagSoup
> > import Data.Maybe
> >
> > parseHelp :: Tag String -> Maybe String
> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == "Download
> > a PDF version of this wiki page" ) y )  /= []
> >then Just $  "http://en.wikipedia.org"; ++  (
> snd $
> > y !!  0 )
> > else Nothing
> >
> >
> > parse :: [ Tag String ] -> Maybe String
> > parse [] = Nothing
> > parse ( x : xs )
> >   | isTagOpen x = case parseHelp x of
> >Just s -> Just s
> >Nothing -> parse xs
> >   | otherwise = parse xs
> >
> >
> > main = do
> >   x <- getLine
> >   tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> > ( getRequest x ) --open url
> >   let lst =  head . sections ( ~== " > print_export>" ) $ tags_1
> >   url =  fromJust . parse $ lst  --rendering url
> >   putStrLn url
> >   tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> > ( getRequest url )
> >   print tags_2
> >
> >
> >
> >
> > ___
> > Haskell-Cafe mailing list
> > Haskell-Cafe@haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread Daniel Patterson

It looks to me that the link is generated by javascript, so unless you can 
script an actual browser into the loop, it may not be a viable approach.

On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote:

> I tried to use the PDF-generation facilities . I wrote a script which
> generates the rendering url . When i am pasting rendering url in
> browser its generating the download file but when i am trying to get
> the tags , its empty. Could some one please tell me what is wrong with
> code.
> Thank You
> Mukesh Tiwari
> 
> import Network.HTTP
> import Text.HTML.TagSoup
> import Data.Maybe
> 
> parseHelp :: Tag String -> Maybe String
> parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == "Download
> a PDF version of this wiki page" ) y )  /= []
>then Just $  "http://en.wikipedia.org"; ++  ( snd $
> y !!  0 )
> else Nothing
> 
> 
> parse :: [ Tag String ] -> Maybe String
> parse [] = Nothing
> parse ( x : xs )
>   | isTagOpen x = case parseHelp x of
>Just s -> Just s
>Nothing -> parse xs
>   | otherwise = parse xs
> 
> 
> main = do
>   x <- getLine
>   tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> ( getRequest x ) --open url
>   let lst =  head . sections ( ~== " print_export>" ) $ tags_1
>   url =  fromJust . parse $ lst  --rendering url
>   putStrLn url
>   tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
> ( getRequest url )
>   print tags_2
> 
> 
> 
> 
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread mukesh tiwari

I tried to use the PDF-generation facilities . I wrote a script which
generates the rendering url . When i am pasting rendering url in
browser its generating the download file but when i am trying to get
the tags , its empty. Could some one please tell me what is wrong with
code.
Thank You
Mukesh Tiwari

import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe

parseHelp :: Tag String -> Maybe String
parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == "Download
a PDF version of this wiki page" ) y )  /= []
 then Just $  "http://en.wikipedia.org"; ++  ( snd $
y !!  0 )
  else Nothing


parse :: [ Tag String ] -> Maybe String
parse [] = Nothing
parse ( x : xs )
   | isTagOpen x = case parseHelp x of
 Just s -> Just s
 Nothing -> parse xs
   | otherwise = parse xs


main = do
x <- getLine
tags_1 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
( getRequest x ) --open url
let lst =  head . sections ( ~== "" ) $ tags_1
url =  fromJust . parse $ lst  --rendering url
putStrLn url
tags_2 <-  fmap parseTags $ getResponseBody =<< simpleHTTP
( getRequest url )
print tags_2




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread mukesh tiwari

Is it possible to automate this process rather than manually clicking
and downloading  using Haskell ?

Thank You
Mukesh Tiwari

On Thu, Sep 8, 2011 at 6:11 PM, Max Rabkin  wrote:

> This doesn't answer your Haskell question, but Wikpedia has
> PDF-generation facilities ("Books"). Take a look at
> http://en.wikipedia.org/wiki/Help:Book (for single articles, just use
> the "download PDF" option in the sidebar).
>
> --Max
>
> On Thu, Sep 8, 2011 at 14:34, mukesh tiwari
>  wrote:
> > Hello all
> > I am trying to write a Haskell program which download html pages from
> > wikipedia   including images and convert them into pdf . I wrote a
> > small script
> >
> > import Network.HTTP
> > import Data.Maybe
> > import Data.List
> >
> > main = do
> >x <- getLine
> >htmlpage <-  getResponseBody =<< simpleHTTP ( getRequest x ) --
> > open url
> >--print.words $ htmlpage
> >let ind_1 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
> > tails $ htmlpage ) $ ""
> >ind_2 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
> > tails $ htmlpage ) $ ""
> >tmphtml = drop ind_1 $ take ind_2  htmlpage
> >writeFile "down.html" tmphtml
> >
> > and its working fine except some symbols are not rendering as it
> > should be. Could some one please suggest me how to accomplish this
> > task.
> >
> > Thank you
> > Mukesh Tiwari
> >
> > ___
> > Haskell-Cafe mailing list
> > Haskell-Cafe@haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread Max Rabkin

This doesn't answer your Haskell question, but Wikpedia has
PDF-generation facilities ("Books"). Take a look at
http://en.wikipedia.org/wiki/Help:Book (for single articles, just use
the "download PDF" option in the sidebar).

--Max

On Thu, Sep 8, 2011 at 14:34, mukesh tiwari
 wrote:
> Hello all
> I am trying to write a Haskell program which download html pages from
> wikipedia   including images and convert them into pdf . I wrote a
> small script
>
> import Network.HTTP
> import Data.Maybe
> import Data.List
>
> main = do
>        x <- getLine
>        htmlpage <-  getResponseBody =<< simpleHTTP ( getRequest x ) --
> open url
>        --print.words $ htmlpage
>        let ind_1 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
> tails $ htmlpage ) $ ""
>            ind_2 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
> tails $ htmlpage ) $ ""
>            tmphtml = drop ind_1 $ take ind_2  htmlpage
>        writeFile "down.html" tmphtml
>
> and its working fine except some symbols are not rendering as it
> should be. Could some one please suggest me how to accomplish this
> task.
>
> Thank you
> Mukesh Tiwari
>
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Converting wiki pages into pdf

2011-09-08 Thread mukesh tiwari

Hello all
I am trying to write a Haskell program which download html pages from
wikipedia   including images and convert them into pdf . I wrote a
small script

import Network.HTTP
import Data.Maybe
import Data.List

main = do
x <- getLine
htmlpage <-  getResponseBody =<< simpleHTTP ( getRequest x ) --
open url
--print.words $ htmlpage
let ind_1 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
tails $ htmlpage ) $ ""
ind_2 = fromJust . ( \n -> findIndex ( n `isPrefixOf`) .
tails $ htmlpage ) $ ""
tmphtml = drop ind_1 $ take ind_2  htmlpage
writeFile "down.html" tmphtml

and its working fine except some symbols are not rendering as it
should be. Could some one please suggest me how to accomplish this
task.

Thank you
Mukesh Tiwari

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

Re: [Haskell-cafe] Converting wiki pages into pdf

[Haskell-cafe] Converting wiki pages into pdf

12 matches

Site Navigation

Mail list logo

Footer information