Re: [Bug-wget] Why does -A not work?
Just try wget2 -nd -l2 -r -A "*little-nemo*s.jpeg" 'http://comicstriplibrary.org/search?search=little+nemo' and you only get little-nemo-19051015-s.jpeg little-nemo-19051022-s.jpeg little-nemo-19051029-s.jpeg little-nemo-19051105-s.jpeg little-nemo-19051112-s.jpeg little-nemo-19051119-s.jpeg little-nemo-19051126-s.jpeg little-nemo-19051203-s.jpeg little-nemo-19051210-s.jpeg little-nemo-19051217-s.jpeg little-nemo-19051224-s.jpeg little-nemo-19051231-s.jpeg little-nemo-19060107-s.jpeg little-nemo-19060114-s.jpeg little-nemo-19060121-s.jpeg little-nemo-19060128-s.jpeg little-nemo-19060204-s.jpeg little-nemo-19060211-s.jpeg little-nemo-19060218-s.jpeg little-nemo-19060225-s.jpeg Regards, Tim On 06/20/2018 09:59 PM, Tim Rühsen wrote: > On 20.06.2018 18:20, Nils Gerlach wrote: >> It does not delete any html-file or anything else. Either it is accepted >> and kept or it is saved forever. >> With the tip about --accept and --acept-regex I can get wget to traverse >> the links but it does not go deep >> enough to get the *l.jpgs I tried to increase -l but to no avail. It seems >> like it is going only 1 link deep. >> And not deletes. > > Yes, my failure. Looking at the code, the regex options are applied > without taking --recursive or --level into account. They are dumb URL > filters. > > We are back at > > wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg" > 'http://comicstriplibrary.org/search?search=little+nemo' > > that doesn't work as expected. Somehow it doesn't follow certain links > so that little-nemo*s.jpeg files aren't found. > > Interestingly, the same options with wget2 are finding + downloading > those files. From a first glimpse: those files are linked from an RSS / > Atom file. Those aren't supported by wget, but wget2 does parse them for > URLs. > > Want to give it a try ? https://gitlab.com/gnuwget/wget2 > > Regards, Tim > >> >> 2018-06-20 16:58 GMT+02:00 Tim Rühsen : >> >>> Hi Niels, >>> >>> please always answer to the mailing list (no problem if you CC me, but >>> not needed). >>> >>> It was just an example for POSIX regexes - it's up to you to work out >>> the details ;-) Or maybe there is a volunteer reading this. >>> >>> The implicitly downloaded HTML pages should be removed after parsing >>> when you use --accept-regex. Except the explicitly 'starting' page from >>> your command line. >>> >>> Regards, Tim >>> >>> On 06/20/2018 04:28 PM, Nils Gerlach wrote: Hi Tim, I am sorry but your command does not work. It only downloads the >>> thumbnails from the first page and follows none of the links. Open the link in a browser. Click on the pictures to get a larger picture. There is a link "high quality picture" the pictures behind those links >>> are the ones i want to download. Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but >>> from the other search result pages, too. Can you work that one out? Does this work with wget? Best result would be if the visited html-pages were deleted by wget. But if they stay I can delete them afterwards. But automatism would be better, that's why I am trying to use wget ;) Thanks for the information on the filename and path, though. Greetings 2018-06-20 16:13 GMT+02:00 Tim Rühsen : > Hi Nils, > > On 06/20/2018 06:16 AM, Nils Gerlach wrote: >> Hi there, >> >> in #wget on freenode I was suggested to write this to you: >> I tried using wget to get some images: >> wget -nd -rH -Dcomicstriplibrary.org -A >> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" > -p -e >> robots=off 'http://comicstriplibrary.org/search?search=little+nemo' >> I wanted to download the images only but wget was not following any of > the >> links so I got that much more into -A. But it still does not follow the >> links. >> Page numbers of the search result contain "page" in the link, links to > the >> big pictures i want wget to download contain "display". Both are given >>> in >> -A and are seen in the html-document wget gets. Neither is followed by > wget. >> >> Why does this not work at all? Website is public, anybody is free to > test. >> But this is not my website! > > -A / -R works only on the filename, not on the path. The docs (man page) > is not very explicit about it. > > Instead try --accept-regex / --reject-regex which acts on the complete > URL - but shell wildcard's won't work. > > For your example this means to replace '.' by '\.' and '*' by '.*'. > > To download those nemo jpegs: > wget -d -rH -Dcomicstriplibrary.org --accept-regex > ".*little-nemo.*n\.jpeg" -p -e robots=off > 'http://comicstriplibrary.org/search?search=little+nemo' > --regex-type=posix > > Regards, Tim > > >>> >>> > signature.asc Description: OpenPGP digital
Re: [Bug-wget] Why does -A not work?
On 20.06.2018 18:20, Nils Gerlach wrote: > It does not delete any html-file or anything else. Either it is accepted > and kept or it is saved forever. > With the tip about --accept and --acept-regex I can get wget to traverse > the links but it does not go deep > enough to get the *l.jpgs I tried to increase -l but to no avail. It seems > like it is going only 1 link deep. > And not deletes. Yes, my failure. Looking at the code, the regex options are applied without taking --recursive or --level into account. They are dumb URL filters. We are back at wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg" 'http://comicstriplibrary.org/search?search=little+nemo' that doesn't work as expected. Somehow it doesn't follow certain links so that little-nemo*s.jpeg files aren't found. Interestingly, the same options with wget2 are finding + downloading those files. From a first glimpse: those files are linked from an RSS / Atom file. Those aren't supported by wget, but wget2 does parse them for URLs. Want to give it a try ? https://gitlab.com/gnuwget/wget2 Regards, Tim > > 2018-06-20 16:58 GMT+02:00 Tim Rühsen : > >> Hi Niels, >> >> please always answer to the mailing list (no problem if you CC me, but >> not needed). >> >> It was just an example for POSIX regexes - it's up to you to work out >> the details ;-) Or maybe there is a volunteer reading this. >> >> The implicitly downloaded HTML pages should be removed after parsing >> when you use --accept-regex. Except the explicitly 'starting' page from >> your command line. >> >> Regards, Tim >> >> On 06/20/2018 04:28 PM, Nils Gerlach wrote: >>> Hi Tim, >>> >>> I am sorry but your command does not work. It only downloads the >> thumbnails >>> from the first page >>> and follows none of the links. Open the link in a browser. Click on the >>> pictures to get a larger picture. >>> There is a link "high quality picture" the pictures behind those links >> are >>> the ones i want to download. >>> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but >> from >>> the other search result pages, too. >>> Can you work that one out? Does this work with wget? Best result would be >>> if the visited html-pages were >>> deleted by wget. But if they stay I can delete them afterwards. But >>> automatism would be better, that's why I am >>> trying to use wget ;) >>> >>> Thanks for the information on the filename and path, though. >>> >>> Greetings >>> >>> 2018-06-20 16:13 GMT+02:00 Tim Rühsen : >>> Hi Nils, On 06/20/2018 06:16 AM, Nils Gerlach wrote: > Hi there, > > in #wget on freenode I was suggested to write this to you: > I tried using wget to get some images: > wget -nd -rH -Dcomicstriplibrary.org -A > "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" -p -e > robots=off 'http://comicstriplibrary.org/search?search=little+nemo' > I wanted to download the images only but wget was not following any of the > links so I got that much more into -A. But it still does not follow the > links. > Page numbers of the search result contain "page" in the link, links to the > big pictures i want wget to download contain "display". Both are given >> in > -A and are seen in the html-document wget gets. Neither is followed by wget. > > Why does this not work at all? Website is public, anybody is free to test. > But this is not my website! -A / -R works only on the filename, not on the path. The docs (man page) is not very explicit about it. Instead try --accept-regex / --reject-regex which acts on the complete URL - but shell wildcard's won't work. For your example this means to replace '.' by '\.' and '*' by '.*'. To download those nemo jpegs: wget -d -rH -Dcomicstriplibrary.org --accept-regex ".*little-nemo.*n\.jpeg" -p -e robots=off 'http://comicstriplibrary.org/search?search=little+nemo' --regex-type=posix Regards, Tim >>> >> >> signature.asc Description: OpenPGP digital signature
Re: [Bug-wget] Why does -A not work?
It does not delete any html-file or anything else. Either it is accepted and kept or it is saved forever. With the tip about --accept and --acept-regex I can get wget to traverse the links but it does not go deep enough to get the *l.jpgs I tried to increase -l but to no avail. It seems like it is going only 1 link deep. And not deletes. 2018-06-20 16:58 GMT+02:00 Tim Rühsen : > Hi Niels, > > please always answer to the mailing list (no problem if you CC me, but > not needed). > > It was just an example for POSIX regexes - it's up to you to work out > the details ;-) Or maybe there is a volunteer reading this. > > The implicitly downloaded HTML pages should be removed after parsing > when you use --accept-regex. Except the explicitly 'starting' page from > your command line. > > Regards, Tim > > On 06/20/2018 04:28 PM, Nils Gerlach wrote: > > Hi Tim, > > > > I am sorry but your command does not work. It only downloads the > thumbnails > > from the first page > > and follows none of the links. Open the link in a browser. Click on the > > pictures to get a larger picture. > > There is a link "high quality picture" the pictures behind those links > are > > the ones i want to download. > > Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but > from > > the other search result pages, too. > > Can you work that one out? Does this work with wget? Best result would be > > if the visited html-pages were > > deleted by wget. But if they stay I can delete them afterwards. But > > automatism would be better, that's why I am > > trying to use wget ;) > > > > Thanks for the information on the filename and path, though. > > > > Greetings > > > > 2018-06-20 16:13 GMT+02:00 Tim Rühsen : > > > >> Hi Nils, > >> > >> On 06/20/2018 06:16 AM, Nils Gerlach wrote: > >>> Hi there, > >>> > >>> in #wget on freenode I was suggested to write this to you: > >>> I tried using wget to get some images: > >>> wget -nd -rH -Dcomicstriplibrary.org -A > >>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" > >> -p -e > >>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo' > >>> I wanted to download the images only but wget was not following any of > >> the > >>> links so I got that much more into -A. But it still does not follow the > >>> links. > >>> Page numbers of the search result contain "page" in the link, links to > >> the > >>> big pictures i want wget to download contain "display". Both are given > in > >>> -A and are seen in the html-document wget gets. Neither is followed by > >> wget. > >>> > >>> Why does this not work at all? Website is public, anybody is free to > >> test. > >>> But this is not my website! > >> > >> -A / -R works only on the filename, not on the path. The docs (man page) > >> is not very explicit about it. > >> > >> Instead try --accept-regex / --reject-regex which acts on the complete > >> URL - but shell wildcard's won't work. > >> > >> For your example this means to replace '.' by '\.' and '*' by '.*'. > >> > >> To download those nemo jpegs: > >> wget -d -rH -Dcomicstriplibrary.org --accept-regex > >> ".*little-nemo.*n\.jpeg" -p -e robots=off > >> 'http://comicstriplibrary.org/search?search=little+nemo' > >> --regex-type=posix > >> > >> Regards, Tim > >> > >> > > > >
Re: [Bug-wget] Why does -A not work?
Hi Tim, I am sorry but your command does not work. It only downloads the thumbnails from the first page and follows none of the links. Open the link in a browser. Click on the pictures to get a larger picture. There is a link "high quality picture" the pictures behind those links are the ones i want to download. Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but from the other search result pages, too. Can you work that one out? Does this work with wget? Best result would be if the visited html-pages were deleted by wget. But if they stay I can delete them afterwards. But automatism would be better, that's why I am trying to use wget ;) Thanks for the information on the filename and path, though. Greetings 2018-06-20 16:13 GMT+02:00 Tim Rühsen : > Hi Nils, > > On 06/20/2018 06:16 AM, Nils Gerlach wrote: > > Hi there, > > > > in #wget on freenode I was suggested to write this to you: > > I tried using wget to get some images: > > wget -nd -rH -Dcomicstriplibrary.org -A > > "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" > -p -e > > robots=off 'http://comicstriplibrary.org/search?search=little+nemo' > > I wanted to download the images only but wget was not following any of > the > > links so I got that much more into -A. But it still does not follow the > > links. > > Page numbers of the search result contain "page" in the link, links to > the > > big pictures i want wget to download contain "display". Both are given in > > -A and are seen in the html-document wget gets. Neither is followed by > wget. > > > > Why does this not work at all? Website is public, anybody is free to > test. > > But this is not my website! > > -A / -R works only on the filename, not on the path. The docs (man page) > is not very explicit about it. > > Instead try --accept-regex / --reject-regex which acts on the complete > URL - but shell wildcard's won't work. > > For your example this means to replace '.' by '\.' and '*' by '.*'. > > To download those nemo jpegs: > wget -d -rH -Dcomicstriplibrary.org --accept-regex > ".*little-nemo.*n\.jpeg" -p -e robots=off > 'http://comicstriplibrary.org/search?search=little+nemo' > --regex-type=posix > > Regards, Tim > >
Re: [Bug-wget] Why does -A not work?
Hi Niels, please always answer to the mailing list (no problem if you CC me, but not needed). It was just an example for POSIX regexes - it's up to you to work out the details ;-) Or maybe there is a volunteer reading this. The implicitly downloaded HTML pages should be removed after parsing when you use --accept-regex. Except the explicitly 'starting' page from your command line. Regards, Tim On 06/20/2018 04:28 PM, Nils Gerlach wrote: > Hi Tim, > > I am sorry but your command does not work. It only downloads the thumbnails > from the first page > and follows none of the links. Open the link in a browser. Click on the > pictures to get a larger picture. > There is a link "high quality picture" the pictures behind those links are > the ones i want to download. > Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but from > the other search result pages, too. > Can you work that one out? Does this work with wget? Best result would be > if the visited html-pages were > deleted by wget. But if they stay I can delete them afterwards. But > automatism would be better, that's why I am > trying to use wget ;) > > Thanks for the information on the filename and path, though. > > Greetings > > 2018-06-20 16:13 GMT+02:00 Tim Rühsen : > >> Hi Nils, >> >> On 06/20/2018 06:16 AM, Nils Gerlach wrote: >>> Hi there, >>> >>> in #wget on freenode I was suggested to write this to you: >>> I tried using wget to get some images: >>> wget -nd -rH -Dcomicstriplibrary.org -A >>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" >> -p -e >>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo' >>> I wanted to download the images only but wget was not following any of >> the >>> links so I got that much more into -A. But it still does not follow the >>> links. >>> Page numbers of the search result contain "page" in the link, links to >> the >>> big pictures i want wget to download contain "display". Both are given in >>> -A and are seen in the html-document wget gets. Neither is followed by >> wget. >>> >>> Why does this not work at all? Website is public, anybody is free to >> test. >>> But this is not my website! >> >> -A / -R works only on the filename, not on the path. The docs (man page) >> is not very explicit about it. >> >> Instead try --accept-regex / --reject-regex which acts on the complete >> URL - but shell wildcard's won't work. >> >> For your example this means to replace '.' by '\.' and '*' by '.*'. >> >> To download those nemo jpegs: >> wget -d -rH -Dcomicstriplibrary.org --accept-regex >> ".*little-nemo.*n\.jpeg" -p -e robots=off >> 'http://comicstriplibrary.org/search?search=little+nemo' >> --regex-type=posix >> >> Regards, Tim >> >> > signature.asc Description: OpenPGP digital signature
Re: [Bug-wget] Why does -A not work?
Hi Nils, On 06/20/2018 06:16 AM, Nils Gerlach wrote: > Hi there, > > in #wget on freenode I was suggested to write this to you: > I tried using wget to get some images: > wget -nd -rH -Dcomicstriplibrary.org -A > "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*" -p -e > robots=off 'http://comicstriplibrary.org/search?search=little+nemo' > I wanted to download the images only but wget was not following any of the > links so I got that much more into -A. But it still does not follow the > links. > Page numbers of the search result contain "page" in the link, links to the > big pictures i want wget to download contain "display". Both are given in > -A and are seen in the html-document wget gets. Neither is followed by wget. > > Why does this not work at all? Website is public, anybody is free to test. > But this is not my website! -A / -R works only on the filename, not on the path. The docs (man page) is not very explicit about it. Instead try --accept-regex / --reject-regex which acts on the complete URL - but shell wildcard's won't work. For your example this means to replace '.' by '\.' and '*' by '.*'. To download those nemo jpegs: wget -d -rH -Dcomicstriplibrary.org --accept-regex ".*little-nemo.*n\.jpeg" -p -e robots=off 'http://comicstriplibrary.org/search?search=little+nemo' --regex-type=posix Regards, Tim signature.asc Description: OpenPGP digital signature