Re: [Bug-wget] Wget follows "button" links

2018-06-05 Thread Tim Rühsen
On 06/05/2018 11:53 AM, CryHard wrote:
> Hey there,
> 
> I've used the following:
> 
> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" 
> --user=myuser --ask-password --no-check-certificate --recursive 
> --page-requisites --adjust-extension --span-hosts 
> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> 
> To download a wiki. The problem is that this will follow "button" links, e.g 
> the links that allow a user to put a page on a watchlist for further 
> modifications. This has led to me watching hundreds of pages. Not only that, 
> but apparently it also follows the links that lead to reverting changes made 
> by others on a page.
> 
> Is there a way to avoid this behavior?

Hi,

that depends on how these "button links" are realized.

A button may be part of a HTML FORM tag/structure where the URL is the
value of the 'action' attribute. Wget doesn't download such URLs because
of the problem you describe.

A dynamic web page can realize "button links" by using simple links.
Wget doesn't know about hidden semantics and so downloads these URLs -
and maybe they trigger some changes in a database.
If this is your issue, you have to look into the HTML files and exclude
those URLs from being downloaded. Or you create a whitelist. Look at
options -A/-R and --accept-regex and --reject-regex.

> I'm using the following version:
> 
>> wget --version
> GNU Wget 1.12 built on linux-gnu.

Ok, you should update wget if possible. Latest version is 1.19.5.

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Wget follows "button" links

2018-06-05 Thread CryHard
Hey Tim,

Thanks for the info. The wiki software we use (xwiki) appends something to wiki 
pages URLs to express a certain behavior. For example, to "watch" a page, the 
button once pressed redirects you to 
"www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument"

Where the only thing that changes is the "WIKI-PAGE-NAME" part.

Also, for actions such as like "deleting" or "reverting" a wiki page, the URL 
changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are 
usually in the middle, before the actual page name. For example: 
www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in 
the middle of the actual wiki page URL.

What I would need to do is exclude from wget visiting any www.wiki.com/delete 
or www.wiki.com/remove/ pages. I'd also need to exclude links that end with 
"xpage=watch=adddocument" which triggers me to watch that page.

I am using v1.12 because the most recent versions have disabled --no-clobber 
and --convert-links from working together. I need --no-clobber because if the 
download stops, I need to be able to resume without re-downloading all the 
files. And I need --convert-links because this needs to work as a local copy. 

>From my understanding the options you mention have been added after v1.12. Is 
>there any way to achieve this?

BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted 
doesn't seem to support this, hence wget keeps redownloading the same files.

Thanks a lot!
‐‐‐ Original Message ‐‐‐

On June 5, 2018 1:57 PM, Tim Rühsen  wrote:

> On 06/05/2018 11:53 AM, CryHard wrote:
> 
> > Hey there,
> > 
> > I've used the following:
> > 
> > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" 
> > --user=myuser --ask-password --no-check-certificate --recursive 
> > --page-requisites --adjust-extension --span-hosts 
> > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> > 
> > To download a wiki. The problem is that this will follow "button" links, 
> > e.g the links that allow a user to put a page on a watchlist for further 
> > modifications. This has led to me watching hundreds of pages. Not only 
> > that, but apparently it also follows the links that lead to reverting 
> > changes made by others on a page.
> > 
> > Is there a way to avoid this behavior?
> 
> Hi,
> 
> that depends on how these "button links" are realized.
> 
> A button may be part of a HTML FORM tag/structure where the URL is the
> 
> value of the 'action' attribute. Wget doesn't download such URLs because
> 
> of the problem you describe.
> 
> A dynamic web page can realize "button links" by using simple links.
> 
> Wget doesn't know about hidden semantics and so downloads these URLs -
> 
> and maybe they trigger some changes in a database.
> 
> If this is your issue, you have to look into the HTML files and exclude
> 
> those URLs from being downloaded. Or you create a whitelist. Look at
> 
> options -A/-R and --accept-regex and --reject-regex.
> 
> > I'm using the following version:
> > 
> > > wget --version
> > > 
> > > GNU Wget 1.12 built on linux-gnu.
> 
> Ok, you should update wget if possible. Latest version is 1.19.5.
> 
> Regards, Tim





Re: [Bug-wget] Wget follows "button" links

2018-06-05 Thread Tim Rühsen
Hi,

> "Both --no-clobber and --convert-links were specified, only
--convert-links will be used."

Right, I missed that. The combination of both flags was buggy by design
(also in 1.12) and suffered from several flaws (not to say bugs).

Regex more like '.*/xpage=watch.*'. The exact syntax depends on
  --regex-type=TYPE   regex type (posix|pcre)

What else can you do... try wget2. It allows the combination of
--no-clobber and --convert-links. And if you find bugs they can be fixed
(other as wget1.x were we have to redesign a whole lot of things).

See https://gitlab.com/gnuwget/wget2

If you don't like to build from git, you can download a pretty recent
tarball from https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.

Signature at https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.sig

Regards, Tim

On 06/05/2018 03:52 PM, CryHard wrote:
> Hey Tim,
> 
> Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since 
> version 1.12.1.
> 
> On my personal mac I have 1.19.5, and when I run the command with both 
> arguments i get: 
> 
> "Both --no-clobber and --convert-links were specified, only --convert-links 
> will be used."
> 
> As a response. 
> 
> Anyway, I might make due without -nc if I can use the regex argument. Could 
> you give an example on how would that argument work in my case? Can I just 
> use www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ?
> 
> Thanks!
> 
> 
> ​Sent with ProtonMail Secure Email.​
> 
> ‐‐‐ Original Message ‐‐‐
> 
> On June 5, 2018 2:40 PM, Tim Rühsen  wrote:
> 
>> Hi,
>>
>> in this case you could try it with -X / --exclude-directories.
>>
>> E.g. wget -X /delete,/remove
>>
>> That wouldn't help with "xpage=watch..." though.
>>
>> And I can't tell you if and how good -X works with wget 1.12.
>>
>> Why (or since when) doesn't --no-clobber plus --convert-links work any
>>
>> more ?
>>
>> Please feel free to open a bug report at
>>
>> https://savannah.gnu.org/bugs/?func=additem=wget with a detailed
>>
>> description, please.
>>
>> Cause it works for me :-)
>>
>> Regards, Tim
>>
>> On 06/05/2018 03:11 PM, CryHard wrote:
>>
>>> Hey Tim,
>>>
>>> Thanks for the info. The wiki software we use (xwiki) appends something to 
>>> wiki pages URLs to express a certain behavior. For example, to "watch" a 
>>> page, the button once pressed redirects you to 
>>> "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument"
>>>
>>> Where the only thing that changes is the "WIKI-PAGE-NAME" part.
>>>
>>> Also, for actions such as like "deleting" or "reverting" a wiki page, the 
>>> URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these 
>>> are usually in the middle, before the actual page name. For example: 
>>> www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is 
>>> in the middle of the actual wiki page URL.
>>>
>>> What I would need to do is exclude from wget visiting any 
>>> www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude 
>>> links that end with "xpage=watch=adddocument" which triggers me to watch 
>>> that page.
>>>
>>> I am using v1.12 because the most recent versions have disabled 
>>> --no-clobber and --convert-links from working together. I need --no-clobber 
>>> because if the download stops, I need to be able to resume without 
>>> re-downloading all the files. And I need --convert-links because this needs 
>>> to work as a local copy.
>>>
>>> From my understanding the options you mention have been added after v1.12. 
>>> Is there any way to achieve this?
>>>
>>> BTW, -N (timestamps) doesn't work, as the server on which the wiki is 
>>> hosted doesn't seem to support this, hence wget keeps redownloading the 
>>> same files.
>>>
>>> Thanks a lot!
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>>
>>> On June 5, 2018 1:57 PM, Tim Rühsen tim.rueh...@gmx.de wrote:
>>>
 On 06/05/2018 11:53 AM, CryHard wrote:

> Hey there,
>
> I've used the following:
>
> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 
> Safari/537.36" --user=myuser --ask-password --no-check-certificate 
> --recursive --page-requisites --adjust-extension --span-hosts 
> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
>
> To download a wiki. The problem is that this will follow "button" links, 
> e.g the links that allow a user to put a page on a watchlist for further 
> modifications. This has led to me watching hundreds of pages. Not only 
> that, but apparently it also follows the links that lead to reverting 
> changes made by others on a page.
>
> Is there a way to avoid this behavior?

 Hi,

 that depends on how these "button links" are realized.

 A button may be part of a HTML FORM tag/structure where the URL is the

 value of 

Re: [Bug-wget] Wget follows "button" links

2018-06-05 Thread CryHard
Hey Tim,

Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since 
version 1.12.1.

On my personal mac I have 1.19.5, and when I run the command with both 
arguments i get: 

"Both --no-clobber and --convert-links were specified, only --convert-links 
will be used."

As a response. 

Anyway, I might make due without -nc if I can use the regex argument. Could you 
give an example on how would that argument work in my case? Can I just use 
www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ?

Thanks!


​Sent with ProtonMail Secure Email.​

‐‐‐ Original Message ‐‐‐

On June 5, 2018 2:40 PM, Tim Rühsen  wrote:

> Hi,
> 
> in this case you could try it with -X / --exclude-directories.
> 
> E.g. wget -X /delete,/remove
> 
> That wouldn't help with "xpage=watch..." though.
> 
> And I can't tell you if and how good -X works with wget 1.12.
> 
> Why (or since when) doesn't --no-clobber plus --convert-links work any
> 
> more ?
> 
> Please feel free to open a bug report at
> 
> https://savannah.gnu.org/bugs/?func=additem=wget with a detailed
> 
> description, please.
> 
> Cause it works for me :-)
> 
> Regards, Tim
> 
> On 06/05/2018 03:11 PM, CryHard wrote:
> 
> > Hey Tim,
> > 
> > Thanks for the info. The wiki software we use (xwiki) appends something to 
> > wiki pages URLs to express a certain behavior. For example, to "watch" a 
> > page, the button once pressed redirects you to 
> > "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument"
> > 
> > Where the only thing that changes is the "WIKI-PAGE-NAME" part.
> > 
> > Also, for actions such as like "deleting" or "reverting" a wiki page, the 
> > URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these 
> > are usually in the middle, before the actual page name. For example: 
> > www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is 
> > in the middle of the actual wiki page URL.
> > 
> > What I would need to do is exclude from wget visiting any 
> > www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude 
> > links that end with "xpage=watch=adddocument" which triggers me to watch 
> > that page.
> > 
> > I am using v1.12 because the most recent versions have disabled 
> > --no-clobber and --convert-links from working together. I need --no-clobber 
> > because if the download stops, I need to be able to resume without 
> > re-downloading all the files. And I need --convert-links because this needs 
> > to work as a local copy.
> > 
> > From my understanding the options you mention have been added after v1.12. 
> > Is there any way to achieve this?
> > 
> > BTW, -N (timestamps) doesn't work, as the server on which the wiki is 
> > hosted doesn't seem to support this, hence wget keeps redownloading the 
> > same files.
> > 
> > Thanks a lot!
> > 
> > ‐‐‐ Original Message ‐‐‐
> > 
> > On June 5, 2018 1:57 PM, Tim Rühsen tim.rueh...@gmx.de wrote:
> > 
> > > On 06/05/2018 11:53 AM, CryHard wrote:
> > > 
> > > > Hey there,
> > > > 
> > > > I've used the following:
> > > > 
> > > > wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
> > > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 
> > > > Safari/537.36" --user=myuser --ask-password --no-check-certificate 
> > > > --recursive --page-requisites --adjust-extension --span-hosts 
> > > > --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
> > > > --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
> > > > 
> > > > To download a wiki. The problem is that this will follow "button" 
> > > > links, e.g the links that allow a user to put a page on a watchlist for 
> > > > further modifications. This has led to me watching hundreds of pages. 
> > > > Not only that, but apparently it also follows the links that lead to 
> > > > reverting changes made by others on a page.
> > > > 
> > > > Is there a way to avoid this behavior?
> > > 
> > > Hi,
> > > 
> > > that depends on how these "button links" are realized.
> > > 
> > > A button may be part of a HTML FORM tag/structure where the URL is the
> > > 
> > > value of the 'action' attribute. Wget doesn't download such URLs because
> > > 
> > > of the problem you describe.
> > > 
> > > A dynamic web page can realize "button links" by using simple links.
> > > 
> > > Wget doesn't know about hidden semantics and so downloads these URLs -
> > > 
> > > and maybe they trigger some changes in a database.
> > > 
> > > If this is your issue, you have to look into the HTML files and exclude
> > > 
> > > those URLs from being downloaded. Or you create a whitelist. Look at
> > > 
> > > options -A/-R and --accept-regex and --reject-regex.
> > > 
> > > > I'm using the following version:
> > > > 
> > > > > wget --version
> > > > > 
> > > > > GNU Wget 1.12 built on linux-gnu.
> > > 
> > > Ok, you should update wget if possible. Latest version is 1.19.5.
> > > 
> > > Regards, Tim





Re: [Bug-wget] Wget follows "button" links

2018-06-05 Thread Tim Rühsen
Hi,

in this case you could try it with -X / --exclude-directories.

E.g. wget -X /delete,/remove

That wouldn't help with "xpage=watch..." though.

And I can't tell you if and how good -X works with wget 1.12.

Why (or since when) doesn't --no-clobber plus --convert-links work any
more ?
Please feel free to open a bug report at
https://savannah.gnu.org/bugs/?func=additem=wget with a detailed
description, please.
Cause it works for me :-)

Regards, Tim

On 06/05/2018 03:11 PM, CryHard wrote:
> Hey Tim,
> 
> Thanks for the info. The wiki software we use (xwiki) appends something to 
> wiki pages URLs to express a certain behavior. For example, to "watch" a 
> page, the button once pressed redirects you to 
> "www.wiki.com/WIKI-PAGE-NAME?xpage=watch=adddocument"
> 
> Where the only thing that changes is the "WIKI-PAGE-NAME" part.
> 
> Also, for actions such as like "deleting" or "reverting" a wiki page, the URL 
> changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these are 
> usually in the middle, before the actual page name. For example: 
> www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is in 
> the middle of the actual wiki page URL.
> 
> What I would need to do is exclude from wget visiting any www.wiki.com/delete 
> or www.wiki.com/remove/ pages. I'd also need to exclude links that end with 
> "xpage=watch=adddocument" which triggers me to watch that page.
> 
> I am using v1.12 because the most recent versions have disabled --no-clobber 
> and --convert-links from working together. I need --no-clobber because if the 
> download stops, I need to be able to resume without re-downloading all the 
> files. And I need --convert-links because this needs to work as a local copy. 
> 
> From my understanding the options you mention have been added after v1.12. Is 
> there any way to achieve this?
> 
> BTW, -N (timestamps) doesn't work, as the server on which the wiki is hosted 
> doesn't seem to support this, hence wget keeps redownloading the same files.
> 
> Thanks a lot!
> ‐‐‐ Original Message ‐‐‐
> 
> On June 5, 2018 1:57 PM, Tim Rühsen  wrote:
> 
>> On 06/05/2018 11:53 AM, CryHard wrote:
>>
>>> Hey there,
>>>
>>> I've used the following:
>>>
>>> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
>>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" 
>>> --user=myuser --ask-password --no-check-certificate --recursive 
>>> --page-requisites --adjust-extension --span-hosts 
>>> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
>>> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
>>>
>>> To download a wiki. The problem is that this will follow "button" links, 
>>> e.g the links that allow a user to put a page on a watchlist for further 
>>> modifications. This has led to me watching hundreds of pages. Not only 
>>> that, but apparently it also follows the links that lead to reverting 
>>> changes made by others on a page.
>>>
>>> Is there a way to avoid this behavior?
>>
>> Hi,
>>
>> that depends on how these "button links" are realized.
>>
>> A button may be part of a HTML FORM tag/structure where the URL is the
>>
>> value of the 'action' attribute. Wget doesn't download such URLs because
>>
>> of the problem you describe.
>>
>> A dynamic web page can realize "button links" by using simple links.
>>
>> Wget doesn't know about hidden semantics and so downloads these URLs -
>>
>> and maybe they trigger some changes in a database.
>>
>> If this is your issue, you have to look into the HTML files and exclude
>>
>> those URLs from being downloaded. Or you create a whitelist. Look at
>>
>> options -A/-R and --accept-regex and --reject-regex.
>>
>>> I'm using the following version:
>>>
 wget --version

 GNU Wget 1.12 built on linux-gnu.
>>
>> Ok, you should update wget if possible. Latest version is 1.19.5.
>>
>> Regards, Tim
> 
> 



signature.asc
Description: OpenPGP digital signature


[Bug-wget] Wget on Windows handling of wildcards

2018-06-05 Thread Sam Habiel
First time poster.

I have a wget command that has a -A flag that contains a wildcard.
It's '*.DAT'. That works fine on Linux. I am trying to get the same
thing to run on Windows, but *.DAT keeps getting expanded by wget (cmd
does no expansion itself). There is no way that I found of suppressing
that. I think I tried everything: single quotes, double quotes, escape
* with ^ (cmd escape char), etc.

The end effect of this is that the first time I run the command, it
works, because wget tries expanding *.DAT and fails, so it sends -A as
*.DAT. If I run the command from the folder that contains the *.DAT
files, it will expand them into arguments.

I did not read the wget source, but I suspect that there is a problem there.

For reference, here's the whole command:

wget -rNndp -A "*.DAT"
"https://foia-vista.osehra.org:443/Patches_By_Application/PSN-NATIONAL
DRUG FILE (NDF)/PPS_DATS/" -P .

Run it twice on Windows to see the problem.

--Sam



Re: [Bug-wget] Wget on Windows handling of wildcards

2018-06-05 Thread Eli Zaretskii
> From: Sam Habiel 
> Date: Tue, 5 Jun 2018 14:16:27 -0400
> 
> I have a wget command that has a -A flag that contains a wildcard.
> It's '*.DAT'. That works fine on Linux. I am trying to get the same
> thing to run on Windows, but *.DAT keeps getting expanded by wget (cmd
> does no expansion itself). There is no way that I found of suppressing
> that. I think I tried everything: single quotes, double quotes, escape
> * with ^ (cmd escape char), etc.

What version of Windows is that?

> For reference, here's the whole command:
> 
> wget -rNndp -A "*.DAT"
> "https://foia-vista.osehra.org:443/Patches_By_Application/PSN-NATIONAL
> DRUG FILE (NDF)/PPS_DATS/" -P .
> 
> Run it twice on Windows to see the problem.

Did you try using "*.[D]AT"?

The problem AFAIK is that C runtime on modern versions of Windows
expands wildcards even when quoted.  So either you need to build wget
with wildcard expansion disabled (using the appropriate global
variable whose details depend on whether you use MSVC or MinGW and
which version of MinGW), or you use the above trick (assuming that
wget can expand such wildcards).  Disabling expansions altogether is
usually not a good option in this case, since you probably need it
with other use cases.

HTH