Re: [Bug-wget] Why does -A not work?

2018-06-21 Thread Tim Rühsen
Just try

wget2 -nd -l2 -r -A "*little-nemo*s.jpeg"
'http://comicstriplibrary.org/search?search=little+nemo'

and you only get
little-nemo-19051015-s.jpeg
little-nemo-19051022-s.jpeg
little-nemo-19051029-s.jpeg
little-nemo-19051105-s.jpeg
little-nemo-19051112-s.jpeg
little-nemo-19051119-s.jpeg
little-nemo-19051126-s.jpeg
little-nemo-19051203-s.jpeg
little-nemo-19051210-s.jpeg
little-nemo-19051217-s.jpeg
little-nemo-19051224-s.jpeg
little-nemo-19051231-s.jpeg
little-nemo-19060107-s.jpeg
little-nemo-19060114-s.jpeg
little-nemo-19060121-s.jpeg
little-nemo-19060128-s.jpeg
little-nemo-19060204-s.jpeg
little-nemo-19060211-s.jpeg
little-nemo-19060218-s.jpeg
little-nemo-19060225-s.jpeg

Regards, Tim

On 06/20/2018 09:59 PM, Tim Rühsen wrote:
> On 20.06.2018 18:20, Nils Gerlach wrote:
>> It does not delete any html-file or anything else. Either it is accepted
>> and kept or it is saved forever.
>> With the tip about --accept and --acept-regex I can get wget to traverse
>> the links but it does not go deep
>> enough to get the *l.jpgs I tried to increase -l but to no avail. It seems
>> like it is going only 1 link deep.
>> And not deletes.
> 
> Yes, my failure. Looking at the code, the regex options are applied
> without taking --recursive or --level into account. They are dumb URL
> filters.
> 
> We are back at
> 
> wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg"
> 'http://comicstriplibrary.org/search?search=little+nemo'
> 
> that doesn't work as expected. Somehow it doesn't follow certain links
> so that little-nemo*s.jpeg files aren't found.
> 
> Interestingly, the same options with wget2 are finding + downloading
> those files. From a first glimpse: those files are linked from an RSS /
> Atom file. Those aren't supported by wget, but wget2 does parse them for
> URLs.
> 
> Want to give it a try ? https://gitlab.com/gnuwget/wget2
> 
> Regards, Tim
> 
>>
>> 2018-06-20 16:58 GMT+02:00 Tim Rühsen :
>>
>>> Hi Niels,
>>>
>>> please always answer to the mailing list (no problem if you CC me, but
>>> not needed).
>>>
>>> It was just an example for POSIX regexes - it's up to you to work out
>>> the details ;-) Or maybe there is a volunteer reading this.
>>>
>>> The implicitly downloaded HTML pages should be removed after parsing
>>> when you use --accept-regex. Except the explicitly 'starting' page from
>>> your command line.
>>>
>>> Regards, Tim
>>>
>>> On 06/20/2018 04:28 PM, Nils Gerlach wrote:
 Hi Tim,

 I am sorry but your command does not work. It only downloads the
>>> thumbnails
 from the first page
 and follows none of the links. Open the link in a browser. Click on the
 pictures to get a larger picture.
 There is a link "high quality picture" the pictures behind those links
>>> are
 the ones i want to download.
 Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but
>>> from
 the other search result pages, too.
 Can you work that one out? Does this work with wget? Best result would be
 if the visited html-pages were
 deleted by wget. But if they stay I can delete them afterwards. But
 automatism would be better, that's why I am
 trying to use wget ;)

 Thanks for the information on the filename and path, though.

 Greetings

 2018-06-20 16:13 GMT+02:00 Tim Rühsen :

> Hi Nils,
>
> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
>> Hi there,
>>
>> in #wget on freenode I was suggested to write this to you:
>> I tried using wget to get some images:
>> wget -nd -rH -Dcomicstriplibrary.org -A
>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
> -p -e
>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
>> I wanted to download the images only but wget was not following any of
> the
>> links so I got that much more into -A. But it still does not follow the
>> links.
>> Page numbers of the search result contain "page" in the link, links to
> the
>> big pictures i want wget to download contain "display". Both are given
>>> in
>> -A and are seen in the html-document wget gets. Neither is followed by
> wget.
>>
>> Why does this not work at all? Website is public, anybody is free to
> test.
>> But this is not my website!
>
> -A / -R works only on the filename, not on the path. The docs (man page)
> is not very explicit about it.
>
> Instead try --accept-regex / --reject-regex which acts on the complete
> URL - but shell wildcard's won't work.
>
> For your example this means to replace '.' by '\.' and '*' by '.*'.
>
> To download those nemo jpegs:
> wget -d -rH -Dcomicstriplibrary.org --accept-regex
> ".*little-nemo.*n\.jpeg" -p -e robots=off
> 'http://comicstriplibrary.org/search?search=little+nemo'
> --regex-type=posix
>
> Regards, Tim
>
>

>>>
>>>
> 



signature.asc
Description: OpenPGP digital 

Re: [Bug-wget] Why does -A not work?

2018-06-20 Thread Tim Rühsen
On 20.06.2018 18:20, Nils Gerlach wrote:
> It does not delete any html-file or anything else. Either it is accepted
> and kept or it is saved forever.
> With the tip about --accept and --acept-regex I can get wget to traverse
> the links but it does not go deep
> enough to get the *l.jpgs I tried to increase -l but to no avail. It seems
> like it is going only 1 link deep.
> And not deletes.

Yes, my failure. Looking at the code, the regex options are applied
without taking --recursive or --level into account. They are dumb URL
filters.

We are back at

wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg"
'http://comicstriplibrary.org/search?search=little+nemo'

that doesn't work as expected. Somehow it doesn't follow certain links
so that little-nemo*s.jpeg files aren't found.

Interestingly, the same options with wget2 are finding + downloading
those files. From a first glimpse: those files are linked from an RSS /
Atom file. Those aren't supported by wget, but wget2 does parse them for
URLs.

Want to give it a try ? https://gitlab.com/gnuwget/wget2

Regards, Tim

> 
> 2018-06-20 16:58 GMT+02:00 Tim Rühsen :
> 
>> Hi Niels,
>>
>> please always answer to the mailing list (no problem if you CC me, but
>> not needed).
>>
>> It was just an example for POSIX regexes - it's up to you to work out
>> the details ;-) Or maybe there is a volunteer reading this.
>>
>> The implicitly downloaded HTML pages should be removed after parsing
>> when you use --accept-regex. Except the explicitly 'starting' page from
>> your command line.
>>
>> Regards, Tim
>>
>> On 06/20/2018 04:28 PM, Nils Gerlach wrote:
>>> Hi Tim,
>>>
>>> I am sorry but your command does not work. It only downloads the
>> thumbnails
>>> from the first page
>>> and follows none of the links. Open the link in a browser. Click on the
>>> pictures to get a larger picture.
>>> There is a link "high quality picture" the pictures behind those links
>> are
>>> the ones i want to download.
>>> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but
>> from
>>> the other search result pages, too.
>>> Can you work that one out? Does this work with wget? Best result would be
>>> if the visited html-pages were
>>> deleted by wget. But if they stay I can delete them afterwards. But
>>> automatism would be better, that's why I am
>>> trying to use wget ;)
>>>
>>> Thanks for the information on the filename and path, though.
>>>
>>> Greetings
>>>
>>> 2018-06-20 16:13 GMT+02:00 Tim Rühsen :
>>>
 Hi Nils,

 On 06/20/2018 06:16 AM, Nils Gerlach wrote:
> Hi there,
>
> in #wget on freenode I was suggested to write this to you:
> I tried using wget to get some images:
> wget -nd -rH -Dcomicstriplibrary.org -A
> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
 -p -e
> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
> I wanted to download the images only but wget was not following any of
 the
> links so I got that much more into -A. But it still does not follow the
> links.
> Page numbers of the search result contain "page" in the link, links to
 the
> big pictures i want wget to download contain "display". Both are given
>> in
> -A and are seen in the html-document wget gets. Neither is followed by
 wget.
>
> Why does this not work at all? Website is public, anybody is free to
 test.
> But this is not my website!

 -A / -R works only on the filename, not on the path. The docs (man page)
 is not very explicit about it.

 Instead try --accept-regex / --reject-regex which acts on the complete
 URL - but shell wildcard's won't work.

 For your example this means to replace '.' by '\.' and '*' by '.*'.

 To download those nemo jpegs:
 wget -d -rH -Dcomicstriplibrary.org --accept-regex
 ".*little-nemo.*n\.jpeg" -p -e robots=off
 'http://comicstriplibrary.org/search?search=little+nemo'
 --regex-type=posix

 Regards, Tim


>>>
>>
>>



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Why does -A not work?

2018-06-20 Thread Nils Gerlach
It does not delete any html-file or anything else. Either it is accepted
and kept or it is saved forever.
With the tip about --accept and --acept-regex I can get wget to traverse
the links but it does not go deep
enough to get the *l.jpgs I tried to increase -l but to no avail. It seems
like it is going only 1 link deep.
And not deletes.

2018-06-20 16:58 GMT+02:00 Tim Rühsen :

> Hi Niels,
>
> please always answer to the mailing list (no problem if you CC me, but
> not needed).
>
> It was just an example for POSIX regexes - it's up to you to work out
> the details ;-) Or maybe there is a volunteer reading this.
>
> The implicitly downloaded HTML pages should be removed after parsing
> when you use --accept-regex. Except the explicitly 'starting' page from
> your command line.
>
> Regards, Tim
>
> On 06/20/2018 04:28 PM, Nils Gerlach wrote:
> > Hi Tim,
> >
> > I am sorry but your command does not work. It only downloads the
> thumbnails
> > from the first page
> > and follows none of the links. Open the link in a browser. Click on the
> > pictures to get a larger picture.
> > There is a link "high quality picture" the pictures behind those links
> are
> > the ones i want to download.
> > Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but
> from
> > the other search result pages, too.
> > Can you work that one out? Does this work with wget? Best result would be
> > if the visited html-pages were
> > deleted by wget. But if they stay I can delete them afterwards. But
> > automatism would be better, that's why I am
> > trying to use wget ;)
> >
> > Thanks for the information on the filename and path, though.
> >
> > Greetings
> >
> > 2018-06-20 16:13 GMT+02:00 Tim Rühsen :
> >
> >> Hi Nils,
> >>
> >> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
> >>> Hi there,
> >>>
> >>> in #wget on freenode I was suggested to write this to you:
> >>> I tried using wget to get some images:
> >>> wget -nd -rH -Dcomicstriplibrary.org -A
> >>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
> >> -p -e
> >>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
> >>> I wanted to download the images only but wget was not following any of
> >> the
> >>> links so I got that much more into -A. But it still does not follow the
> >>> links.
> >>> Page numbers of the search result contain "page" in the link, links to
> >> the
> >>> big pictures i want wget to download contain "display". Both are given
> in
> >>> -A and are seen in the html-document wget gets. Neither is followed by
> >> wget.
> >>>
> >>> Why does this not work at all? Website is public, anybody is free to
> >> test.
> >>> But this is not my website!
> >>
> >> -A / -R works only on the filename, not on the path. The docs (man page)
> >> is not very explicit about it.
> >>
> >> Instead try --accept-regex / --reject-regex which acts on the complete
> >> URL - but shell wildcard's won't work.
> >>
> >> For your example this means to replace '.' by '\.' and '*' by '.*'.
> >>
> >> To download those nemo jpegs:
> >> wget -d -rH -Dcomicstriplibrary.org --accept-regex
> >> ".*little-nemo.*n\.jpeg" -p -e robots=off
> >> 'http://comicstriplibrary.org/search?search=little+nemo'
> >> --regex-type=posix
> >>
> >> Regards, Tim
> >>
> >>
> >
>
>


Re: [Bug-wget] Why does -A not work?

2018-06-20 Thread Nils Gerlach
Hi Tim,

I am sorry but your command does not work. It only downloads the thumbnails
from the first page
and follows none of the links. Open the link in a browser. Click on the
pictures to get a larger picture.
There is a link "high quality picture" the pictures behind those links are
the ones i want to download.
Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but from
the other search result pages, too.
Can you work that one out? Does this work with wget? Best result would be
if the visited html-pages were
deleted by wget. But if they stay I can delete them afterwards. But
automatism would be better, that's why I am
trying to use wget ;)

Thanks for the information on the filename and path, though.

Greetings

2018-06-20 16:13 GMT+02:00 Tim Rühsen :

> Hi Nils,
>
> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
> > Hi there,
> >
> > in #wget on freenode I was suggested to write this to you:
> > I tried using wget to get some images:
> > wget -nd -rH -Dcomicstriplibrary.org -A
> > "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
> -p -e
> > robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
> > I wanted to download the images only but wget was not following any of
> the
> > links so I got that much more into -A. But it still does not follow the
> > links.
> > Page numbers of the search result contain "page" in the link, links to
> the
> > big pictures i want wget to download contain "display". Both are given in
> > -A and are seen in the html-document wget gets. Neither is followed by
> wget.
> >
> > Why does this not work at all? Website is public, anybody is free to
> test.
> > But this is not my website!
>
> -A / -R works only on the filename, not on the path. The docs (man page)
> is not very explicit about it.
>
> Instead try --accept-regex / --reject-regex which acts on the complete
> URL - but shell wildcard's won't work.
>
> For your example this means to replace '.' by '\.' and '*' by '.*'.
>
> To download those nemo jpegs:
> wget -d -rH -Dcomicstriplibrary.org --accept-regex
> ".*little-nemo.*n\.jpeg" -p -e robots=off
> 'http://comicstriplibrary.org/search?search=little+nemo'
> --regex-type=posix
>
> Regards, Tim
>
>


Re: [Bug-wget] Why does -A not work?

2018-06-20 Thread Tim Rühsen
Hi Niels,

please always answer to the mailing list (no problem if you CC me, but
not needed).

It was just an example for POSIX regexes - it's up to you to work out
the details ;-) Or maybe there is a volunteer reading this.

The implicitly downloaded HTML pages should be removed after parsing
when you use --accept-regex. Except the explicitly 'starting' page from
your command line.

Regards, Tim

On 06/20/2018 04:28 PM, Nils Gerlach wrote:
> Hi Tim,
> 
> I am sorry but your command does not work. It only downloads the thumbnails
> from the first page
> and follows none of the links. Open the link in a browser. Click on the
> pictures to get a larger picture.
> There is a link "high quality picture" the pictures behind those links are
> the ones i want to download.
> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but from
> the other search result pages, too.
> Can you work that one out? Does this work with wget? Best result would be
> if the visited html-pages were
> deleted by wget. But if they stay I can delete them afterwards. But
> automatism would be better, that's why I am
> trying to use wget ;)
> 
> Thanks for the information on the filename and path, though.
> 
> Greetings
> 
> 2018-06-20 16:13 GMT+02:00 Tim Rühsen :
> 
>> Hi Nils,
>>
>> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
>>> Hi there,
>>>
>>> in #wget on freenode I was suggested to write this to you:
>>> I tried using wget to get some images:
>>> wget -nd -rH -Dcomicstriplibrary.org -A
>>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
>> -p -e
>>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
>>> I wanted to download the images only but wget was not following any of
>> the
>>> links so I got that much more into -A. But it still does not follow the
>>> links.
>>> Page numbers of the search result contain "page" in the link, links to
>> the
>>> big pictures i want wget to download contain "display". Both are given in
>>> -A and are seen in the html-document wget gets. Neither is followed by
>> wget.
>>>
>>> Why does this not work at all? Website is public, anybody is free to
>> test.
>>> But this is not my website!
>>
>> -A / -R works only on the filename, not on the path. The docs (man page)
>> is not very explicit about it.
>>
>> Instead try --accept-regex / --reject-regex which acts on the complete
>> URL - but shell wildcard's won't work.
>>
>> For your example this means to replace '.' by '\.' and '*' by '.*'.
>>
>> To download those nemo jpegs:
>> wget -d -rH -Dcomicstriplibrary.org --accept-regex
>> ".*little-nemo.*n\.jpeg" -p -e robots=off
>> 'http://comicstriplibrary.org/search?search=little+nemo'
>> --regex-type=posix
>>
>> Regards, Tim
>>
>>
> 



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Why does -A not work?

2018-06-20 Thread Tim Rühsen
Hi Nils,

On 06/20/2018 06:16 AM, Nils Gerlach wrote:
> Hi there,
> 
> in #wget on freenode I was suggested to write this to you:
> I tried using wget to get some images:
> wget -nd -rH -Dcomicstriplibrary.org -A
> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" -p -e
> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
> I wanted to download the images only but wget was not following any of the
> links so I got that much more into -A. But it still does not follow the
> links.
> Page numbers of the search result contain "page" in the link, links to the
> big pictures i want wget to download contain "display". Both are given in
> -A and are seen in the html-document wget gets. Neither is followed by wget.
> 
> Why does this not work at all? Website is public, anybody is free to test.
> But this is not my website!

-A / -R works only on the filename, not on the path. The docs (man page)
is not very explicit about it.

Instead try --accept-regex / --reject-regex which acts on the complete
URL - but shell wildcard's won't work.

For your example this means to replace '.' by '\.' and '*' by '.*'.

To download those nemo jpegs:
wget -d -rH -Dcomicstriplibrary.org --accept-regex
".*little-nemo.*n\.jpeg" -p -e robots=off
'http://comicstriplibrary.org/search?search=little+nemo' --regex-type=posix

Regards, Tim



signature.asc
Description: OpenPGP digital signature