Thank you all for replying. I managed to write a python script. It depends on PyQt4 . I am curious if we have any thing like PyQt4 in Haskell.
import sys from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import * #http://www.rkblog.rk.edu.pl/w/p/webkit-pyqt-rendering-web-pages/ #http://pastebin.com/xunfQ959 # http://bharatikunal.wordpress.com/2010/01/31/converting-html-to-pdf-with-python-and-qt/ #http://www.riverbankcomputing.com/pipermail/pyqt/2009-January/021592.html def convertFile( ): web.print_( printer ) print "done" QApplication.exit() if __name__=="__main__": url = raw_input("enter url:") filename = raw_input("enter file name:") app = QApplication( sys.argv ) web = QWebView() web.load(QUrl( url )) #web.show() printer = QPrinter( QPrinter.HighResolution ) printer.setPageSize( QPrinter.A4 ) printer.setOutputFormat( QPrinter.PdfFormat ) printer.setOutputFileName( filename + ".pdf" ) QObject.connect( web , SIGNAL("loadFinished(bool)"), convertFile ) sys.exit(app.exec_()) On Fri, Sep 9, 2011 at 11:03 AM, Matti Oinas <matti.oi...@gmail.com> wrote: > The whole wikipedia database can also be downloaded if that is any help. > > http://en.wikipedia.org/wiki/Wikipedia:Database_download > > There is also text in that site saying "Please do not use a web > crawler to download large numbers of articles. Aggressive crawling of > the server can cause a dramatic slow-down of Wikipedia." > > Matti > > 2011/9/9 Kyle Murphy <orc...@gmail.com>: > > It's worth pointing out at this point (as alluded to by Conrad) that what > > you're attempting might be considered somewhat rude, and possibly > slightly > > illegal (depending on the insanity of the legal system in question). > > Automated site scraping (what you're essentially doing) is generally > frowned > > upon by most hosts unless it follows some very specific guidelines, > usually > > at a minimum respecting the restrictions specified in the robots.txt file > > contained in the domains root. Furthermore, depending on the type of data > in > > question, and if a EULA was agreed to if the site requires an account, > doing > > any kind of automated processing might be disallowed. Now, I think > wikipedia > > has a fairly lenient policy, or at least I hope it does considering it's > > community driven, but depending on how much of wikipedia you're planning > on > > crawling you might at the very least consider severly throttling the > process > > to keep from sucking up too much bandwidth. > > > > On the topic of how to actually perform that crawl, you should probably > > check out the format of the link provided in the download PDF element. > After > > looking at an article (note, I'm basing this off a quick glance at a > single > > page) it looks like you should be able to modify the URL provided in the > > "Permanent link" element to generate the PDF link by changing the title > > argument to arttitle, adding a new title argument with the value > > "Special:Book", and adding the new arguments "bookcmd=render_article" and > > "writer=rl". For example if the permanent link to the article is: > > > > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269 > > > > Then the PDF URL is: > > > > > http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl > > > > This is all rather hacky as well, and none of it has been tested so it > might > > not actually work, although I see no reason why it shouldn't. It's also > > fragile, as if wikipedia changes just about anything it could all brake, > but > > that's the risk you run anytime you resort of site scraping. > > > > -R. Kyle Murphy > > -- > > Curiosity was framed, Ignorance killed the cat. > > > > > > On Thu, Sep 8, 2011 at 23:40, Conrad Parker <con...@metadecks.org> > wrote: > >> > >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" <mukeshtiwari.ii...@gmail.com> > >> wrote: > >> > > >> > Thank your for reply Daniel. Considering my limited knowledge of web > >> > programming and javascript , first i need to simulated the some sort > of > >> > browser in my program which will run the javascript and will generate > the > >> > pdf. After that i can download the pdf . Is this you mean ? Is > >> > Network.Browser any helpful for this purpose ? Is there way to solve > this > >> > problem ? > >> > Sorry for many questions but this is my first web application > program > >> > and i am trying hard to finish it. > >> > > >> > >> Have you tried finding out if simple URLs exist for this, that don't > >> require Javascript? Does Wikipedia have a policy on this? > >> > >> Conrad. > >> > >> > > >> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson > >> > <lists.hask...@dbp.mm.st> wrote: > >> >> > >> >> It looks to me that the link is generated by javascript, so unless > you > >> >> can script an actual browser into the loop, it may not be a viable > approach. > >> >> > >> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote: > >> >> > >> >> > I tried to use the PDF-generation facilities . I wrote a script > which > >> >> > generates the rendering url . When i am pasting rendering url in > >> >> > browser its generating the download file but when i am trying to > get > >> >> > the tags , its empty. Could some one please tell me what is wrong > >> >> > with > >> >> > code. > >> >> > Thank You > >> >> > Mukesh Tiwari > >> >> > > >> >> > import Network.HTTP > >> >> > import Text.HTML.TagSoup > >> >> > import Data.Maybe > >> >> > > >> >> > parseHelp :: Tag String -> Maybe String > >> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == > >> >> > "Download > >> >> > a PDF version of this wiki page" ) y ) /= [] > >> >> > then Just $ "http://en.wikipedia.org" > ++ > >> >> > ( snd $ > >> >> > y !! 0 ) > >> >> > else Nothing > >> >> > > >> >> > > >> >> > parse :: [ Tag String ] -> Maybe String > >> >> > parse [] = Nothing > >> >> > parse ( x : xs ) > >> >> > | isTagOpen x = case parseHelp x of > >> >> > Just s -> Just s > >> >> > Nothing -> parse xs > >> >> > | otherwise = parse xs > >> >> > > >> >> > > >> >> > main = do > >> >> > x <- getLine > >> >> > tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP > >> >> > ( getRequest x ) --open url > >> >> > let lst = head . sections ( ~== "<div class=portal > id=p-coll- > >> >> > print_export>" ) $ tags_1 > >> >> > url = fromJust . parse $ lst --rendering url > >> >> > putStrLn url > >> >> > tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP > >> >> > ( getRequest url ) > >> >> > print tags_2 > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > Haskell-Cafe mailing list > >> >> > Haskell-Cafe@haskell.org > >> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe > >> >> > >> > > >> > > >> > _______________________________________________ > >> > Haskell-Cafe mailing list > >> > Haskell-Cafe@haskell.org > >> > http://www.haskell.org/mailman/listinfo/haskell-cafe > >> > > >> > >> _______________________________________________ > >> Haskell-Cafe mailing list > >> Haskell-Cafe@haskell.org > >> http://www.haskell.org/mailman/listinfo/haskell-cafe > >> > > > > > > _______________________________________________ > > Haskell-Cafe mailing list > > Haskell-Cafe@haskell.org > > http://www.haskell.org/mailman/listinfo/haskell-cafe > > > > > > > > -- > /*******************************************************************/ > > try { > log.trace("Id=" + request.getUser().getId() + " accesses " + > manager.getPage().getUrl().toString()) > } catch(NullPointerException e) {} > > /*******************************************************************/ > > This is a real code, but please make the world a bit better place and > don’t do it, ever. > > * > http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html* >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe