[Bug-wget] subscribe

2018-06-29 Thread Zoe Blade
Thanks!



Re: [Bug-wget] Feature request: option to not download rejected files

2018-06-29 Thread Zoe Blade
> ...it would be more useful to avoid downloading rejected files altogether...

Hmm, after a bit more digging, I see this isn't a new request: 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217243  Is anyone working on 
this?


[Bug-wget] Feature request: option to not download rejected files

2018-06-29 Thread Zoe Blade
Hi!

First of all, I find wget very useful, so thank you to everyone who has 
contributed to it!

I gather that the rejection list (--reject and --reject-regex) is used to 
determine which downloaded files to permanently save or not.  While that's 
sometimes useful, there are other times it would be more useful to avoid 
downloading rejected files altogether.

For example, rejecting any file with a question mark in it, to avoid 
duplication due to endless combinations of parameters.  It would put far less 
strain on the server to be able to just download the main version of each page 
and not its various iterations.

Someone even went as far as to write a quick hack to add this functionality for 
themselves: 
https://stackoverflow.com/questions/12704197/wget-reject-still-downloads-file  
It would be much nicer if it was built in, in a more robust and extensible 
manner.

Thanks,
Zoë.


Re: [Bug-wget] Feature request: option to not download rejected files

2018-06-29 Thread Zoe Blade
For anyone else who needs to do this, I adapted Sergey Svishchev's 1.8-era 
patch for 19.1 (one of the few versions I managed to get to compile in OS X; 
I'm on a Mac, and not the best programmer):

recur.c:578
-  if (blacklist_contains (blacklist, url))
+  if (blacklist_contains (blacklist, url) || !acceptable (url))

It's not ideal, but it seems to solve the problem as a temporary fix.  
Hopefully it might help someone else who needs this functionality.

Cheers,
Zoë.


Re: [Bug-wget] Feature request: option to not download rejected files

2018-06-29 Thread Tim Rühsen
On 06/29/2018 03:20 PM, Zoe Blade wrote:
> For anyone else who needs to do this, I adapted Sergey Svishchev's 1.8-era 
> patch for 19.1 (one of the few versions I managed to get to compile in OS X; 
> I'm on a Mac, and not the best programmer):
> 
> recur.c:578
> -  if (blacklist_contains (blacklist, url))
> +  if (blacklist_contains (blacklist, url) || !acceptable (url))
> 
> It's not ideal, but it seems to solve the problem as a temporary fix.  
> Hopefully it might help someone else who needs this functionality.

Hi Zoë,

we recently had a discussion (20.6.2018 "Why does -A not work") where I
confirmed that --reject-regex works like a filter for detected URLs.

BTW, the OP wanted --reject-regex to download+parse HTML (and delete
thereafter if matching the rejected regex) - so the opposite from your
request.

In Wget2 there is an extra option for this, --filter-urls. Maybe
--filter-mime-type is also worth a look.

Best would be if you can provide a small example / reproducer. It can
also be a hand-crafted HTML file.

Regards, Tim



signature.asc
Description: OpenPGP digital signature