Thanks for the advice all. I'm trying httrack now but the other wget options are good to know about, especially Alex's point about saving a WARC file.
One clarification: I definitely don't want to deal with the database, nor can I. We don't have admin or server access. Even if we did, I don't think preserving the db would be wise or necessary. Best, Eric On Mon, Oct 6, 2014 at 9:24 AM, Alexander Duryee <alexanderdur...@gmail.com> wrote: > I was dealing with a lot of sites that would shunt the user around based on > their user agent (e.g. very old sites that had completely different pages > for Netscape and IE), so I needed something neutral that wouldn't get > caught in a browser-specific branch. Suffice to say, nothing ever checks > for Amiga browsers :) > > On Mon, Oct 6, 2014 at 12:08 PM, Little, James Clarence IV < > j.lit...@miami.edu> wrote: > > > I love that user agent. > > > > This the wget command I've used to back up sites that have pretty urls: > > > > wget -v --mirror -p --html-extension -e robots=off --base=./ -k -P ./ > <URL> > > > > > > – Jamie > > ________________________________________ > > From: Code for Libraries <CODE4LIB@LISTSERV.ND.EDU> on behalf of > > Alexander Duryee <alexanderdur...@gmail.com> > > Sent: Monday, October 06, 2014 11:51 AM > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] wget archiving for dummies > > > > I've used wget extensively for web preservation. It's a remarkably > > powerful tool, but there are some notable features/caveats to be aware > of: > > > > 1) You absolutely should use the --warc-file=<NAME> and > > --warc-header=<STRING> options. These will create a WARC file alongside > > the usual wget filedump, which captures essential information (process > > provenance, server request/responses, raw data before wget adjusts it) > for > > preservation. The warc-header option includes user-added metadata, such > as > > the name, purpose, etc. of the capture. It's likely that you won't use > the > > WARC for access, but keeping it as a preservation copy of the site is > > invaluable. > > > > 2) Javascript, AJAX queries, links in rich media, and such are completely > > opaque to wget. As such, you'll need to QC aggressively to ensure that > you > > captured everything you intended to. My method was to run a generic wget > > capture[1], QC it, and manually download missing objects. I'd then pass > > everything back into wget to create a complete WARC file containing the > > full capture. It's janky, but gets the job done. > > > > 3) Do be careful of commenting options, which often turn into spider > > traps. The latest versions of wget have regex support, so you can > > blacklist certain URLs that you know will trap the crawler. > > > > If the site is proving stubborn, I can take a look off-list. > > > > Best of luck, > > Alex > > > > [1] I've used the following successfully: wget > > --user-agent="AmigaVoyager/3.2 > > (AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>" > > --page-requisites -e robots=off --random-wait --wait=5 --recursive > > --level=0 > > --no-parent --convert-links <URL> > > >