Hi, in this case you could try it with -X / --exclude-directories.
E.g. wget -X /delete,/remove That wouldn't help with "xpage=watch..." though. And I can't tell you if and how good -X works with wget 1.12. Why (or since when) doesn't --no-clobber plus --convert-links work any more ? Please feel free to open a bug report at https://savannah.gnu.org/bugs/?func=additem&group=wget with a detailed description, please. Cause it works for me :-) Regards, Tim On 06/05/2018 03:11 PM, CryHard wrote: > Hey Tim, > > Thanks for the info. The wiki software we use (xwiki) appends something to > wiki pages URLs to express a certain behavior. For example, to "watch" a > page, the button once pressed redirects you to > "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument" > > Where the only thing that changes is the "WIKI-PAGE-NAME" part. > > Also, for actions such as like "deleting" or "reverting" a wiki page, the URL > changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are > usually in the middle, before the actual page name. For example: > www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in > the middle of the actual wiki page URL. > > What I would need to do is exclude from wget visiting any www.wiki.com/delete > or www.wiki.com/remove/ pages. I'd also need to exclude links that end with > "xpage=watch&do=adddocument" which triggers me to watch that page. > > I am using v1.12 because the most recent versions have disabled --no-clobber > and --convert-links from working together. I need --no-clobber because if the > download stops, I need to be able to resume without re-downloading all the > files. And I need --convert-links because this needs to work as a local copy. > > From my understanding the options you mention have been added after v1.12. Is > there any way to achieve this? > > BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted > doesn't seem to support this, hence wget keeps redownloading the same files. > > Thanks a lot! > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > On June 5, 2018 1:57 PM, Tim Rühsen <tim.rueh...@gmx.de> wrote: > >> On 06/05/2018 11:53 AM, CryHard wrote: >> >>> Hey there, >>> >>> I've used the following: >>> >>> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) >>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" >>> --user=myuser --ask-password --no-check-certificate --recursive >>> --page-requisites --adjust-extension --span-hosts >>> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com >>> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W >>> >>> To download a wiki. The problem is that this will follow "button" links, >>> e.g the links that allow a user to put a page on a watchlist for further >>> modifications. This has led to me watching hundreds of pages. Not only >>> that, but apparently it also follows the links that lead to reverting >>> changes made by others on a page. >>> >>> Is there a way to avoid this behavior? >> >> Hi, >> >> that depends on how these "button links" are realized. >> >> A button may be part of a HTML FORM tag/structure where the URL is the >> >> value of the 'action' attribute. Wget doesn't download such URLs because >> >> of the problem you describe. >> >> A dynamic web page can realize "button links" by using simple links. >> >> Wget doesn't know about hidden semantics and so downloads these URLs - >> >> and maybe they trigger some changes in a database. >> >> If this is your issue, you have to look into the HTML files and exclude >> >> those URLs from being downloaded. Or you create a whitelist. Look at >> >> options -A/-R and --accept-regex and --reject-regex. >> >>> I'm using the following version: >>> >>>> wget --version >>>> >>>> GNU Wget 1.12 built on linux-gnu. >> >> Ok, you should update wget if possible. Latest version is 1.19.5. >> >> Regards, Tim > >
signature.asc
Description: OpenPGP digital signature