On 06/05/2018 11:53 AM, CryHard wrote: > Hey there, > > I've used the following: > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" > --user=myuser --ask-password --no-check-certificate --recursive > --page-requisites --adjust-extension --span-hosts > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > To download a wiki. The problem is that this will follow "button" links, e.g > the links that allow a user to put a page on a watchlist for further > modifications. This has led to me watching hundreds of pages. Not only that, > but apparently it also follows the links that lead to reverting changes made > by others on a page. > > Is there a way to avoid this behavior?
Hi, that depends on how these "button links" are realized. A button may be part of a HTML FORM tag/structure where the URL is the value of the 'action' attribute. Wget doesn't download such URLs because of the problem you describe. A dynamic web page can realize "button links" by using simple links. Wget doesn't know about hidden semantics and so downloads these URLs - and maybe they trigger some changes in a database. If this is your issue, you have to look into the HTML files and exclude those URLs from being downloaded. Or you create a whitelist. Look at options -A/-R and --accept-regex and --reject-regex. > I'm using the following version: > >> wget --version > GNU Wget 1.12 built on linux-gnu. Ok, you should update wget if possible. Latest version is 1.19.5. Regards, Tim
signature.asc
Description: OpenPGP digital signature