Re: [CODE4LIB] wget archiving for dummies

Alexander Duryee Mon, 06 Oct 2014 08:52:29 -0700

I've used wget extensively for web preservation.  It's a remarkably
powerful tool, but there are some notable features/caveats to be aware of:


1) You absolutely should use the --warc-file=<NAME> and
--warc-header=<STRING> options.  These will create a WARC file alongside
the usual wget filedump, which captures essential information (process
provenance, server request/responses, raw data before wget adjusts it) for
preservation.  The warc-header option includes user-added metadata, such as
the name, purpose, etc. of the capture.  It's likely that you won't use the
WARC for access, but keeping it as a preservation copy of the site is
invaluable.

2) Javascript, AJAX queries, links in rich media, and such are completely
opaque to wget.  As such, you'll need to QC aggressively to ensure that you
captured everything you intended to.  My method was to run a generic wget
capture[1], QC it, and manually download missing objects.  I'd then pass
everything back into wget to create a complete WARC file containing the
full capture.  It's janky, but gets the job done.

3) Do be careful of commenting options, which often turn into spider
traps.  The latest versions of wget have regex support, so you can
blacklist certain URLs that you know will trap the crawler.

If the site is proving stubborn, I can take a look off-list.

Best of luck,
Alex

[1] I've used the following successfully: wget --user-agent="AmigaVoyager/3.2
(AmigaOS/MC680x0)" --warc-file=<FILENAME> --warc-header="<STRING>"
--page-requisites -e robots=off --random-wait --wait=5 --recursive --level=0
--no-parent --convert-links <URL>

Re: [CODE4LIB] wget archiving for dummies

Reply via email to