The whole wikipedia database can also be downloaded if that is any help. http://en.wikipedia.org/wiki/Wikipedia:Database_download
There is also text in that site saying "Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia." Matti 2011/9/9 Kyle Murphy <orc...@gmail.com>: > It's worth pointing out at this point (as alluded to by Conrad) that what > you're attempting might be considered somewhat rude, and possibly slightly > illegal (depending on the insanity of the legal system in question). > Automated site scraping (what you're essentially doing) is generally frowned > upon by most hosts unless it follows some very specific guidelines, usually > at a minimum respecting the restrictions specified in the robots.txt file > contained in the domains root. Furthermore, depending on the type of data in > question, and if a EULA was agreed to if the site requires an account, doing > any kind of automated processing might be disallowed. Now, I think wikipedia > has a fairly lenient policy, or at least I hope it does considering it's > community driven, but depending on how much of wikipedia you're planning on > crawling you might at the very least consider severly throttling the process > to keep from sucking up too much bandwidth. > > On the topic of how to actually perform that crawl, you should probably > check out the format of the link provided in the download PDF element. After > looking at an article (note, I'm basing this off a quick glance at a single > page) it looks like you should be able to modify the URL provided in the > "Permanent link" element to generate the PDF link by changing the title > argument to arttitle, adding a new title argument with the value > "Special:Book", and adding the new arguments "bookcmd=render_article" and > "writer=rl". For example if the permanent link to the article is: > > http://en.wikipedia.org/w/index.php?title=Shapinsay&oldid=449266269 > > Then the PDF URL is: > > http://en.wikipedia.org/w/index.php?arttitle=Shapinsay&oldid=449266269&title=Special:Book&bookcmd=render_article&write=rl > > This is all rather hacky as well, and none of it has been tested so it might > not actually work, although I see no reason why it shouldn't. It's also > fragile, as if wikipedia changes just about anything it could all brake, but > that's the risk you run anytime you resort of site scraping. > > -R. Kyle Murphy > -- > Curiosity was framed, Ignorance killed the cat. > > > On Thu, Sep 8, 2011 at 23:40, Conrad Parker <con...@metadecks.org> wrote: >> >> On Sep 9, 2011 7:33 AM, "mukesh tiwari" <mukeshtiwari.ii...@gmail.com> >> wrote: >> > >> > Thank your for reply Daniel. Considering my limited knowledge of web >> > programming and javascript , first i need to simulated the some sort of >> > browser in my program which will run the javascript and will generate the >> > pdf. After that i can download the pdf . Is this you mean ? Is >> > Network.Browser any helpful for this purpose ? Is there way to solve this >> > problem ? >> > Sorry for many questions but this is my first web application program >> > and i am trying hard to finish it. >> > >> >> Have you tried finding out if simple URLs exist for this, that don't >> require Javascript? Does Wikipedia have a policy on this? >> >> Conrad. >> >> > >> > On Fri, Sep 9, 2011 at 4:17 AM, Daniel Patterson >> > <lists.hask...@dbp.mm.st> wrote: >> >> >> >> It looks to me that the link is generated by javascript, so unless you >> >> can script an actual browser into the loop, it may not be a viable >> >> approach. >> >> >> >> On Sep 8, 2011, at 3:57 PM, mukesh tiwari wrote: >> >> >> >> > I tried to use the PDF-generation facilities . I wrote a script which >> >> > generates the rendering url . When i am pasting rendering url in >> >> > browser its generating the download file but when i am trying to get >> >> > the tags , its empty. Could some one please tell me what is wrong >> >> > with >> >> > code. >> >> > Thank You >> >> > Mukesh Tiwari >> >> > >> >> > import Network.HTTP >> >> > import Text.HTML.TagSoup >> >> > import Data.Maybe >> >> > >> >> > parseHelp :: Tag String -> Maybe String >> >> > parseHelp ( TagOpen _ y ) = if ( filter ( \( a , b ) -> b == >> >> > "Download >> >> > a PDF version of this wiki page" ) y ) /= [] >> >> > then Just $ "http://en.wikipedia.org" ++ >> >> > ( snd $ >> >> > y !! 0 ) >> >> > else Nothing >> >> > >> >> > >> >> > parse :: [ Tag String ] -> Maybe String >> >> > parse [] = Nothing >> >> > parse ( x : xs ) >> >> > | isTagOpen x = case parseHelp x of >> >> > Just s -> Just s >> >> > Nothing -> parse xs >> >> > | otherwise = parse xs >> >> > >> >> > >> >> > main = do >> >> > x <- getLine >> >> > tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP >> >> > ( getRequest x ) --open url >> >> > let lst = head . sections ( ~== "<div class=portal id=p-coll- >> >> > print_export>" ) $ tags_1 >> >> > url = fromJust . parse $ lst --rendering url >> >> > putStrLn url >> >> > tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP >> >> > ( getRequest url ) >> >> > print tags_2 >> >> > >> >> > >> >> > >> >> > >> >> > _______________________________________________ >> >> > Haskell-Cafe mailing list >> >> > Haskell-Cafe@haskell.org >> >> > http://www.haskell.org/mailman/listinfo/haskell-cafe >> >> >> > >> > >> > _______________________________________________ >> > Haskell-Cafe mailing list >> > Haskell-Cafe@haskell.org >> > http://www.haskell.org/mailman/listinfo/haskell-cafe >> > >> >> _______________________________________________ >> Haskell-Cafe mailing list >> Haskell-Cafe@haskell.org >> http://www.haskell.org/mailman/listinfo/haskell-cafe >> > > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > > -- /*******************************************************************/ try { log.trace("Id=" + request.getUser().getId() + " accesses " + manager.getPage().getUrl().toString()) } catch(NullPointerException e) {} /*******************************************************************/ This is a real code, but please make the world a bit better place and don’t do it, ever. * http://www.javacodegeeks.com/2011/01/10-tips-proper-application-logging.html * _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe