Hey Tim, Thanks for the info. The wiki software we use (xwiki) appends something to wiki pages URLs to express a certain behavior. For example, to "watch" a page, the button once pressed redirects you to "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
Where the only thing that changes is the "WIKI-PAGE-NAME" part. Also, for actions such as like "deleting" or "reverting" a wiki page, the URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are usually in the middle, before the actual page name. For example: www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in the middle of the actual wiki page URL. What I would need to do is exclude from wget visiting any www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude links that end with "xpage=watch&do=adddocument" which triggers me to watch that page. I am using v1.12 because the most recent versions have disabled --no-clobber and --convert-links from working together. I need --no-clobber because if the download stops, I need to be able to resume without re-downloading all the files. And I need --convert-links because this needs to work as a local copy. >From my understanding the options you mention have been added after v1.12. Is >there any way to achieve this? BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted doesn't seem to support this, hence wget keeps redownloading the same files. Thanks a lot! ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On June 5, 2018 1:57 PM, Tim Rühsen <tim.rueh...@gmx.de> wrote: > On 06/05/2018 11:53 AM, CryHard wrote: > > > Hey there, > > > > I've used the following: > > > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" > > --user=myuser --ask-password --no-check-certificate --recursive > > --page-requisites --adjust-extension --span-hosts > > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > > > To download a wiki. The problem is that this will follow "button" links, > > e.g the links that allow a user to put a page on a watchlist for further > > modifications. This has led to me watching hundreds of pages. Not only > > that, but apparently it also follows the links that lead to reverting > > changes made by others on a page. > > > > Is there a way to avoid this behavior? > > Hi, > > that depends on how these "button links" are realized. > > A button may be part of a HTML FORM tag/structure where the URL is the > > value of the 'action' attribute. Wget doesn't download such URLs because > > of the problem you describe. > > A dynamic web page can realize "button links" by using simple links. > > Wget doesn't know about hidden semantics and so downloads these URLs - > > and maybe they trigger some changes in a database. > > If this is your issue, you have to look into the HTML files and exclude > > those URLs from being downloaded. Or you create a whitelist. Look at > > options -A/-R and --accept-regex and --reject-regex. > > > I'm using the following version: > > > > > wget --version > > > > > > GNU Wget 1.12 built on linux-gnu. > > Ok, you should update wget if possible. Latest version is 1.19.5. > > Regards, Tim