Re: [Discuss] What's the best site-crawler utility?

2014-01-08 Thread Richard Pieri
Daniel Barrett wrote: Well, a script doesn't need human-readability. :-) Trust me, this is not hard. I did it a few years ago with minimal difficulty (using a couple of Emacs macros, if memory serves). If you recall, the decision is that a novice has volunteered to take over as a way to learn

Re: [Discuss] What's the best site-crawler utility?

2014-01-08 Thread Bill Horne
On 1/8/2014 10:10 AM, Richard Pieri wrote: Daniel Barrett wrote: Well, a script doesn't need human-readability. :-) Trust me, this is not hard. I did it a few years ago with minimal difficulty (using a couple of Emacs macros, if memory serves). If you recall, the decision is that a novice has

Re: [Discuss] What's the best site-crawler utility?

2014-01-08 Thread Richard Pieri
Bill Horne wrote: the result of mirroring a site would be a lot of separate html files, one for each link on the site. Is this not correct? You'll get a lot of separate files, yes. What's in those files is something you need to see for yourself. -- Rich P.

[Discuss] What's the best site-crawler utility?

2014-01-07 Thread Bill Horne
I need to copy the contents of a wiki into static pages, so please recommend a good web-crawler that can download an existing site into static content pages. It needs to run on Debian 6.0. Bill -- Bill Horne 339-364-8487 ___ Discuss mailing list

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Richard Pieri
Bill Horne wrote: I need to copy the contents of a wiki into static pages, so please recommend a good web-crawler that can download an existing site into static content pages. It needs to run on Debian 6.0. Remember that I wrote how wikis have a spate of problems? This is the biggest one.

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Matthew Gillen
On 1/7/2014 6:49 PM, Bill Horne wrote: I need to copy the contents of a wiki into static pages, so please recommend a good web-crawler that can download an existing site into static content pages. It needs to run on Debian 6.0. wget -k -m -np http://mysite is what I used to use. -k

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Matthew Gillen
On 1/7/2014 7:28 PM, Matthew Gillen wrote: On 1/7/2014 6:49 PM, Bill Horne wrote: I need to copy the contents of a wiki into static pages, so please recommend a good web-crawler that can download an existing site into static content pages. It needs to run on Debian 6.0. wget -k -m -np

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Richard Pieri
Matthew Gillen wrote: wget -k -m -np http://mysite I've tried this. It's messy at best. Wiki pages aren't static HTML. They're dynamically generated and they come with all sorts of style sheets and embedded scripts. Yes, you can get the text but it'll be text as rendered by a wiki. It

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Richard Pieri
Daniel Barrett wrote: For instance, you can write a simple script to hit Special:AllPages (which links to every article on the wiki), and dump each page to HTML with curl or wget. (Special:AllPages displays only N links at a time, Yes, but that's not humanly-readable. It's a dynamically

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Tom Metro
Matthew Gillen wrote: wget -k -m -np http://mysite I create an emergency backup static version of dynamic sites using: wget -q -N -r -l inf -p -k --adjust-extension http://mysite The option -m is equivalent to -r -N -l inf --no-remove-listing, but I didn't want --no-remove-listing (I don't

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Greg Rundlett (freephile)
Hi Bill, GPL - licensed HTTrack Website Copier works well (http://www.httrack.com/). I have not tried it on a MediaWiki site, but it's pretty adept at copying websites including dynamically generated websites. They say: It allows you to download a World Wide Web site from the Internet to a

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Greg Rundlett (freephile)
Also, I just discovered a MediaWiki extension written by Tim Starling that may suit your needs. As the name implies, its for dumping to HTML. http://www.mediawiki.org/wiki/Extension:DumpHTML As for processing the XML produced by export or MediaWiki dump tools, here is info on that XML schema

Re: [Discuss] What's the best site-crawler utility?

2014-01-07 Thread Eric Chadbourne
Plus one for HTTrack. I used it a couple of months ago to convert a terrible Joomla hacked site to HTML. It was a pain to use at first, like having to use Firefox, but it worked as advertised. Hope that helps. On Tue, Jan 7, 2014 at 10:34 PM, Greg Rundlett (freephile) g...@freephile.com wrote: