Re: [Bug-wget] Wget follows "button" links

CryHard Tue, 05 Jun 2018 06:12:27 -0700

Hey Tim,

Thanks for the info. The wiki software we use (xwiki) appends something to wiki 
pages URLs to express a certain behavior. For example, to "watch" a page, the 
button once pressed redirects you to 
"www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"

Where the only thing that changes is the "WIKI-PAGE-NAME" part.

Also, for actions such as like "deleting" or "reverting" a wiki page, the URL 
changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are 
usually in the middle, before the actual page name. For example: 
www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in 
the middle of the actual wiki page URL.

What I would need to do is exclude from wget visiting any www.wiki.com/delete 
or www.wiki.com/remove/ pages. I'd also need to exclude links that end with 
"xpage=watch&do=adddocument" which triggers me to watch that page.

I am using v1.12 because the most recent versions have disabled --no-clobber 
and --convert-links from working together. I need --no-clobber because if the 
download stops, I need to be able to resume without re-downloading all the 
files. And I need --convert-links because this needs to work as a local copy. 

>From my understanding the options you mention have been added after v1.12. Is 
>there any way to achieve this?

BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted 
doesn't seem to support this, hence wget keeps redownloading the same files.

Thanks a lot!
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On June 5, 2018 1:57 PM, Tim Rühsen <[email protected]> wrote:

> On 06/05/2018 11:53 AM, CryHard wrote:
> 
> > Hey there,
> > 
> > I've used the following:
> > 
> > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" 
> > --user=myuser --ask-password --no-check-certificate --recursive 
> > --page-requisites --adjust-extension --span-hosts 
> > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> > 
> > To download a wiki. The problem is that this will follow "button" links, 
> > e.g the links that allow a user to put a page on a watchlist for further 
> > modifications. This has led to me watching hundreds of pages. Not only 
> > that, but apparently it also follows the links that lead to reverting 
> > changes made by others on a page.
> > 
> > Is there a way to avoid this behavior?
> 
> Hi,
> 
> that depends on how these "button links" are realized.
> 
> A button may be part of a HTML FORM tag/structure where the URL is the
> 
> value of the 'action' attribute. Wget doesn't download such URLs because
> 
> of the problem you describe.
> 
> A dynamic web page can realize "button links" by using simple links.
> 
> Wget doesn't know about hidden semantics and so downloads these URLs -
> 
> and maybe they trigger some changes in a database.
> 
> If this is your issue, you have to look into the HTML files and exclude
> 
> those URLs from being downloaded. Or you create a whitelist. Look at
> 
> options -A/-R and --accept-regex and --reject-regex.
> 
> > I'm using the following version:
> > 
> > > wget --version
> > > 
> > > GNU Wget 1.12 built on linux-gnu.
> 
> Ok, you should update wget if possible. Latest version is 1.19.5.
> 
> Regards, Tim

Re: [Bug-wget] Wget follows "button" links

Reply via email to