Re: [Bug-wget] Wget follows "button" links
On 06/05/2018 11:53 AM, CryHard wrote: > Hey there, > > I've used the following: > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" > --user=myuser --ask-password --no-check-certificate --recursive > --page-requisites --adjust-extension --span-hosts > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > To download a wiki. The problem is that this will follow "button" links, e.g > the links that allow a user to put a page on a watchlist for further > modifications. This has led to me watching hundreds of pages. Not only that, > but apparently it also follows the links that lead to reverting changes made > by others on a page. > > Is there a way to avoid this behavior? Hi, that depends on how these "button links" are realized. A button may be part of a HTML FORM tag/structure where the URL is the value of the 'action' attribute. Wget doesn't download such URLs because of the problem you describe. A dynamic web page can realize "button links" by using simple links. Wget doesn't know about hidden semantics and so downloads these URLs - and maybe they trigger some changes in a database. If this is your issue, you have to look into the HTML files and exclude those URLs from being downloaded. Or you create a whitelist. Look at options -A/-R and --accept-regex and --reject-regex. > I'm using the following version: > >> wget --version > GNU Wget 1.12 built on linux-gnu. Ok, you should update wget if possible. Latest version is 1.19.5. Regards, Tim signature.asc Description: OpenPGP digital signature
Re: [Bug-wget] Wget follows "button" links
Hey Tim, Thanks for the info. The wiki software we use (xwiki) appends something to wiki pages URLs to express a certain behavior. For example, to "watch" a page, the button once pressed redirects you to "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument" Where the only thing that changes is the "WIKI-PAGE-NAME" part. Also, for actions such as like "deleting" or "reverting" a wiki page, the URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are usually in the middle, before the actual page name. For example: www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in the middle of the actual wiki page URL. What I would need to do is exclude from wget visiting any www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude links that end with "xpage=watch=adddocument" which triggers me to watch that page. I am using v1.12 because the most recent versions have disabled --no-clobber and --convert-links from working together. I need --no-clobber because if the download stops, I need to be able to resume without re-downloading all the files. And I need --convert-links because this needs to work as a local copy. >From my understanding the options you mention have been added after v1.12. Is >there any way to achieve this? BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted doesn't seem to support this, hence wget keeps redownloading the same files. Thanks a lot! ‐‐‐ Original Message ‐‐‐ On June 5, 2018 1:57 PM, Tim Rühsen wrote: > On 06/05/2018 11:53 AM, CryHard wrote: > > > Hey there, > > > > I've used the following: > > > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" > > --user=myuser --ask-password --no-check-certificate --recursive > > --page-requisites --adjust-extension --span-hosts > > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > > > To download a wiki. The problem is that this will follow "button" links, > > e.g the links that allow a user to put a page on a watchlist for further > > modifications. This has led to me watching hundreds of pages. Not only > > that, but apparently it also follows the links that lead to reverting > > changes made by others on a page. > > > > Is there a way to avoid this behavior? > > Hi, > > that depends on how these "button links" are realized. > > A button may be part of a HTML FORM tag/structure where the URL is the > > value of the 'action' attribute. Wget doesn't download such URLs because > > of the problem you describe. > > A dynamic web page can realize "button links" by using simple links. > > Wget doesn't know about hidden semantics and so downloads these URLs - > > and maybe they trigger some changes in a database. > > If this is your issue, you have to look into the HTML files and exclude > > those URLs from being downloaded. Or you create a whitelist. Look at > > options -A/-R and --accept-regex and --reject-regex. > > > I'm using the following version: > > > > > wget --version > > > > > > GNU Wget 1.12 built on linux-gnu. > > Ok, you should update wget if possible. Latest version is 1.19.5. > > Regards, Tim
Re: [Bug-wget] Wget follows "button" links
Hi, > "Both --no-clobber and --convert-links were specified, only --convert-links will be used." Right, I missed that. The combination of both flags was buggy by design (also in 1.12) and suffered from several flaws (not to say bugs). Regex more like '.*/xpage=watch.*'. The exact syntax depends on --regex-type=TYPE regex type (posix|pcre) What else can you do... try wget2. It allows the combination of --no-clobber and --convert-links. And if you find bugs they can be fixed (other as wget1.x were we have to redesign a whole lot of things). See https://gitlab.com/gnuwget/wget2 If you don't like to build from git, you can download a pretty recent tarball from https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz. Signature at https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.sig Regards, Tim On 06/05/2018 03:52 PM, CryHard wrote: > Hey Tim, > > Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since > version 1.12.1. > > On my personal mac I have 1.19.5, and when I run the command with both > arguments i get: > > "Both --no-clobber and --convert-links were specified, only --convert-links > will be used." > > As a response. > > Anyway, I might make due without -nc if I can use the regex argument. Could > you give an example on how would that argument work in my case? Can I just > use www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ? > > Thanks! > > > Sent with ProtonMail Secure Email. > > ‐‐‐ Original Message ‐‐‐ > > On June 5, 2018 2:40 PM, Tim Rühsen wrote: > >> Hi, >> >> in this case you could try it with -X / --exclude-directories. >> >> E.g. wget -X /delete,/remove >> >> That wouldn't help with "xpage=watch..." though. >> >> And I can't tell you if and how good -X works with wget 1.12. >> >> Why (or since when) doesn't --no-clobber plus --convert-links work any >> >> more ? >> >> Please feel free to open a bug report at >> >> https://savannah.gnu.org/bugs/?func=additem=wget with a detailed >> >> description, please. >> >> Cause it works for me :-) >> >> Regards, Tim >> >> On 06/05/2018 03:11 PM, CryHard wrote: >> >>> Hey Tim, >>> >>> Thanks for the info. The wiki software we use (xwiki) appends something to >>> wiki pages URLs to express a certain behavior. For example, to "watch" a >>> page, the button once pressed redirects you to >>> "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument" >>> >>> Where the only thing that changes is the "WIKI-PAGE-NAME" part. >>> >>> Also, for actions such as like "deleting" or "reverting" a wiki page, the >>> URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these >>> are usually in the middle, before the actual page name. For example: >>> www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is >>> in the middle of the actual wiki page URL. >>> >>> What I would need to do is exclude from wget visiting any >>> www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude >>> links that end with "xpage=watch=adddocument" which triggers me to watch >>> that page. >>> >>> I am using v1.12 because the most recent versions have disabled >>> --no-clobber and --convert-links from working together. I need --no-clobber >>> because if the download stops, I need to be able to resume without >>> re-downloading all the files. And I need --convert-links because this needs >>> to work as a local copy. >>> >>> From my understanding the options you mention have been added after v1.12. >>> Is there any way to achieve this? >>> >>> BTW, -N (timestamps) doesn't work, as the server on which the wiki is >>> hosted doesn't seem to support this, hence wget keeps redownloading the >>> same files. >>> >>> Thanks a lot! >>> >>> ‐‐‐ Original Message ‐‐‐ >>> >>> On June 5, 2018 1:57 PM, Tim Rühsen tim.rueh...@gmx.de wrote: >>> On 06/05/2018 11:53 AM, CryHard wrote: > Hey there, > > I've used the following: > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 > Safari/537.36" --user=myuser --ask-password --no-check-certificate > --recursive --page-requisites --adjust-extension --span-hosts > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > To download a wiki. The problem is that this will follow "button" links, > e.g the links that allow a user to put a page on a watchlist for further > modifications. This has led to me watching hundreds of pages. Not only > that, but apparently it also follows the links that lead to reverting > changes made by others on a page. > > Is there a way to avoid this behavior? Hi, that depends on how these "button links" are realized. A button may be part of a HTML FORM tag/structure where the URL is the value of
Re: [Bug-wget] Wget follows "button" links
Hey Tim, Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since version 1.12.1. On my personal mac I have 1.19.5, and when I run the command with both arguments i get: "Both --no-clobber and --convert-links were specified, only --convert-links will be used." As a response. Anyway, I might make due without -nc if I can use the regex argument. Could you give an example on how would that argument work in my case? Can I just use www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ? Thanks! Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On June 5, 2018 2:40 PM, Tim Rühsen wrote: > Hi, > > in this case you could try it with -X / --exclude-directories. > > E.g. wget -X /delete,/remove > > That wouldn't help with "xpage=watch..." though. > > And I can't tell you if and how good -X works with wget 1.12. > > Why (or since when) doesn't --no-clobber plus --convert-links work any > > more ? > > Please feel free to open a bug report at > > https://savannah.gnu.org/bugs/?func=additem=wget with a detailed > > description, please. > > Cause it works for me :-) > > Regards, Tim > > On 06/05/2018 03:11 PM, CryHard wrote: > > > Hey Tim, > > > > Thanks for the info. The wiki software we use (xwiki) appends something to > > wiki pages URLs to express a certain behavior. For example, to "watch" a > > page, the button once pressed redirects you to > > "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument" > > > > Where the only thing that changes is the "WIKI-PAGE-NAME" part. > > > > Also, for actions such as like "deleting" or "reverting" a wiki page, the > > URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these > > are usually in the middle, before the actual page name. For example: > > www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is > > in the middle of the actual wiki page URL. > > > > What I would need to do is exclude from wget visiting any > > www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude > > links that end with "xpage=watch=adddocument" which triggers me to watch > > that page. > > > > I am using v1.12 because the most recent versions have disabled > > --no-clobber and --convert-links from working together. I need --no-clobber > > because if the download stops, I need to be able to resume without > > re-downloading all the files. And I need --convert-links because this needs > > to work as a local copy. > > > > From my understanding the options you mention have been added after v1.12. > > Is there any way to achieve this? > > > > BTW, -N (timestamps) doesn't work, as the server on which the wiki is > > hosted doesn't seem to support this, hence wget keeps redownloading the > > same files. > > > > Thanks a lot! > > > > ‐‐‐ Original Message ‐‐‐ > > > > On June 5, 2018 1:57 PM, Tim Rühsen tim.rueh...@gmx.de wrote: > > > > > On 06/05/2018 11:53 AM, CryHard wrote: > > > > > > > Hey there, > > > > > > > > I've used the following: > > > > > > > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) > > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 > > > > Safari/537.36" --user=myuser --ask-password --no-check-certificate > > > > --recursive --page-requisites --adjust-extension --span-hosts > > > > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com > > > > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W > > > > > > > > To download a wiki. The problem is that this will follow "button" > > > > links, e.g the links that allow a user to put a page on a watchlist for > > > > further modifications. This has led to me watching hundreds of pages. > > > > Not only that, but apparently it also follows the links that lead to > > > > reverting changes made by others on a page. > > > > > > > > Is there a way to avoid this behavior? > > > > > > Hi, > > > > > > that depends on how these "button links" are realized. > > > > > > A button may be part of a HTML FORM tag/structure where the URL is the > > > > > > value of the 'action' attribute. Wget doesn't download such URLs because > > > > > > of the problem you describe. > > > > > > A dynamic web page can realize "button links" by using simple links. > > > > > > Wget doesn't know about hidden semantics and so downloads these URLs - > > > > > > and maybe they trigger some changes in a database. > > > > > > If this is your issue, you have to look into the HTML files and exclude > > > > > > those URLs from being downloaded. Or you create a whitelist. Look at > > > > > > options -A/-R and --accept-regex and --reject-regex. > > > > > > > I'm using the following version: > > > > > > > > > wget --version > > > > > > > > > > GNU Wget 1.12 built on linux-gnu. > > > > > > Ok, you should update wget if possible. Latest version is 1.19.5. > > > > > > Regards, Tim
Re: [Bug-wget] Wget follows "button" links
Hi, in this case you could try it with -X / --exclude-directories. E.g. wget -X /delete,/remove That wouldn't help with "xpage=watch..." though. And I can't tell you if and how good -X works with wget 1.12. Why (or since when) doesn't --no-clobber plus --convert-links work any more ? Please feel free to open a bug report at https://savannah.gnu.org/bugs/?func=additem=wget with a detailed description, please. Cause it works for me :-) Regards, Tim On 06/05/2018 03:11 PM, CryHard wrote: > Hey Tim, > > Thanks for the info. The wiki software we use (xwiki) appends something to > wiki pages URLs to express a certain behavior. For example, to "watch" a > page, the button once pressed redirects you to > "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument" > > Where the only thing that changes is the "WIKI-PAGE-NAME" part. > > Also, for actions such as like "deleting" or "reverting" a wiki page, the URL > changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are > usually in the middle, before the actual page name. For example: > www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in > the middle of the actual wiki page URL. > > What I would need to do is exclude from wget visiting any www.wiki.com/delete > or www.wiki.com/remove/ pages. I'd also need to exclude links that end with > "xpage=watch=adddocument" which triggers me to watch that page. > > I am using v1.12 because the most recent versions have disabled --no-clobber > and --convert-links from working together. I need --no-clobber because if the > download stops, I need to be able to resume without re-downloading all the > files. And I need --convert-links because this needs to work as a local copy. > > From my understanding the options you mention have been added after v1.12. Is > there any way to achieve this? > > BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted > doesn't seem to support this, hence wget keeps redownloading the same files. > > Thanks a lot! > ‐‐‐ Original Message ‐‐‐ > > On June 5, 2018 1:57 PM, Tim Rühsen wrote: > >> On 06/05/2018 11:53 AM, CryHard wrote: >> >>> Hey there, >>> >>> I've used the following: >>> >>> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) >>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" >>> --user=myuser --ask-password --no-check-certificate --recursive >>> --page-requisites --adjust-extension --span-hosts >>> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com >>> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W >>> >>> To download a wiki. The problem is that this will follow "button" links, >>> e.g the links that allow a user to put a page on a watchlist for further >>> modifications. This has led to me watching hundreds of pages. Not only >>> that, but apparently it also follows the links that lead to reverting >>> changes made by others on a page. >>> >>> Is there a way to avoid this behavior? >> >> Hi, >> >> that depends on how these "button links" are realized. >> >> A button may be part of a HTML FORM tag/structure where the URL is the >> >> value of the 'action' attribute. Wget doesn't download such URLs because >> >> of the problem you describe. >> >> A dynamic web page can realize "button links" by using simple links. >> >> Wget doesn't know about hidden semantics and so downloads these URLs - >> >> and maybe they trigger some changes in a database. >> >> If this is your issue, you have to look into the HTML files and exclude >> >> those URLs from being downloaded. Or you create a whitelist. Look at >> >> options -A/-R and --accept-regex and --reject-regex. >> >>> I'm using the following version: >>> wget --version GNU Wget 1.12 built on linux-gnu. >> >> Ok, you should update wget if possible. Latest version is 1.19.5. >> >> Regards, Tim > > signature.asc Description: OpenPGP digital signature
[Bug-wget] Wget on Windows handling of wildcards
First time poster. I have a wget command that has a -A flag that contains a wildcard. It's '*.DAT'. That works fine on Linux. I am trying to get the same thing to run on Windows, but *.DAT keeps getting expanded by wget (cmd does no expansion itself). There is no way that I found of suppressing that. I think I tried everything: single quotes, double quotes, escape * with ^ (cmd escape char), etc. The end effect of this is that the first time I run the command, it works, because wget tries expanding *.DAT and fails, so it sends -A as *.DAT. If I run the command from the folder that contains the *.DAT files, it will expand them into arguments. I did not read the wget source, but I suspect that there is a problem there. For reference, here's the whole command: wget -rNndp -A "*.DAT" "https://foia-vista.osehra.org:443/Patches_By_Application/PSN-NATIONAL DRUG FILE (NDF)/PPS_DATS/" -P . Run it twice on Windows to see the problem. --Sam
Re: [Bug-wget] Wget on Windows handling of wildcards
> From: Sam Habiel > Date: Tue, 5 Jun 2018 14:16:27 -0400 > > I have a wget command that has a -A flag that contains a wildcard. > It's '*.DAT'. That works fine on Linux. I am trying to get the same > thing to run on Windows, but *.DAT keeps getting expanded by wget (cmd > does no expansion itself). There is no way that I found of suppressing > that. I think I tried everything: single quotes, double quotes, escape > * with ^ (cmd escape char), etc. What version of Windows is that? > For reference, here's the whole command: > > wget -rNndp -A "*.DAT" > "https://foia-vista.osehra.org:443/Patches_By_Application/PSN-NATIONAL > DRUG FILE (NDF)/PPS_DATS/" -P . > > Run it twice on Windows to see the problem. Did you try using "*.[D]AT"? The problem AFAIK is that C runtime on modern versions of Windows expands wildcards even when quoted. So either you need to build wget with wildcard expansion disabled (using the appropriate global variable whose details depend on whether you use MSVC or MinGW and which version of MinGW), or you use the above trick (assuming that wget can expand such wildcards). Disabling expansions altogether is usually not a good option in this case, since you probably need it with other use cases. HTH