A number of others have suggested other approaches, but since you started with 
wget, here are the two wget commands I recently used to archive a 
wordpress-behind-exproxy site. The first logs into ezproxy and saves the login 
as a cookie. The second uses to cookie to access a site through exproxy 

wget  --no-check-certificate --keep-session-cookies  --save-cookies cookies.txt 
 --post-data 'user=yeatesst&pass=PASSWORD&auth=d1&url' 
https://login.EZPROXYMACHINE/login

wget --restrict-file-names=windows  --default-page=index.php -e robots=off  
--mirror --user-agent="" --ignore-length --keep-session-cookies  --save-cookies 
cookies.txt --load-cookies cookies.txt --recursive  --page-requisites 
--convert-links --backup-converted "http://WORDPRESSMACHINE. 
EZPROXYMACHINE/BLOGNAME"

cheers
stuart


-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Phetteplace
Sent: Monday, 6 October 2014 7:44 p.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] wget archiving for dummies

Hey C4L,

If I wanted to archive a Wordpress site, how would I do so?

More elaborate: our library recently got a "donation" of a remote Wordpress 
site, sitting one directory below the root of a domain. I can tell from a 
cursory look it's a Wordpress site. We've never archived a website before and I 
don't need to do anything fancy, just download a workable copy as it presently 
exists. I've heard this can be as simple as:

wget -m $PATH_TO_SITE_ROOT

but that's not working as planned. Wget's convert links feature doesn't seem to 
be quite so simple; if I download the site, disable my network connection, then 
host locally, some 20 resources aren't available. Mostly images which are under 
the same directory. Possibly loaded via AJAX. Advice?

(Anticipated) pertinent advice: I shouldn't be doing this at all, we should 
outsource to Archive-It or similar, who actually know what they're doing.
Yes/no?

Best,
Eric

Reply via email to