​After I got home I thought I could improve on the script. The following script pulls down the urls and passes them through a while loop that reduces the name of the url down to the name of the .jpg given in front of the query string. There are a lot of things that could be refactored to clean it up but it works:
#1/usr/env/bin/ bash # Crawl the site, build the url list, and pass it into the variable url. url=$(wget --spider --force-html -r -l2 " http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/" 2>&1 | grep '^--' | awk '{ print $3 }') #Set how many characters in front of ".jpg" to start building the name string. front_int=6 printf %s "$url" | while IFS= read -r raw_url do #Cut the string a the characters ".jpg" pos=${raw_url%%.jpg?*} #determine the character position of the cut string. pos_int=$((${#pos})) #reduce the number by the value of $front_int. (( pos_int -= $front_int)) #build a new string based on the pos the range provided by front_int temp_name=${raw_url:pos_int:front_int} #Clean up the image name. image_name=$(echo "$temp_name" | sed 's#.*/\(.*\)$#\1#g') #get the images. wget -O "${image_name}.jpg" "$raw_url" done On Tue, Jan 27, 2015 at 2:48 PM, Todd Millecam <tyg...@gmail.com> wrote: > alright, you got the 20 second, the 2 minute, and now the 20 minute help > solution: > > Open a terminal and do the following: > > cd /tmp > mkdir images > cd images > wget --spider -r -l2 -A jpg > http://sites.google.com/site/thebookofgimp/home/chapter-2-photograph-retouching/ > 2>&1 | grep '^--' | awk '{print $3}' > imagesList.txt > for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done > > That'll download all the images. . .and some crap. Look through it and > delete the crap out of /tmp/images and you're done. > > > Explanation for those who want to improve their linux terminal fu: > > make a temporary directory inside of tmp for cleanliness. > > Now, the wget command is confusing. --spider and -r mean "just grab urls, > don't download anything, but look through and get every url you can find > from the following location. -l2 means to just look two directories deep, > and -A jpg means to only print out stuff that includes "jpg" in the url. > > From there, it's the big long URL, and then some fun little unix specific > stuff. > > 2>&1 is an output redirect. 0 is standard in, 1 is standard out, and 2 is > standard error. wget is a weird program and prints everything to standard > error by default, so this makes it move all the data from standard error to > standard out. We want it on standard out so we can send the output of wget > (all our URLs) to a different program to filter through them. You see, > wget isn't giving us clean urls, it's giving us some crap output lines, and > the useful output lines come in a string like: > > > --2015-01-27 14:38:33-- > https://e572cad7-a-62cb3a1a-s-sites.googlegroups.com/site/thebookofgimp/home/chapter-2-photograph-retouching/2.100.jpg?attachauth=ANoY7cqIGpuUDdagEljYRFF7WMX2G3rAxez0XLIAOW9cXpAnjqilN4X2HyaRWIblk29ORjgMg28jrQuQmBisXSw0d3gYh912nr4DtRyT5Jqk0KVEfJRqC2u92vG7TlxK75odZ1uWVaUrpEvUw1A52TZbuU7Dju7DIPQzou3dskyDSRrh0VAPHrI-znqeKeJ7NuzJqEc8WcLl4MnUpO-dgUZB7i8Eq_z3FFstaXyhjQGcbht8xZ0cBPFvBgw2gWYhuDQ4lqDHJSru&attredirects=0 > > > we don't care about the --<date >-- and want to get rid of it, so we have > to filter down. That's why we do the output redirect first, so we can use > some Linux filter programs, specifically grep and awk. > > grep is a regular expression tool, which means it's a very powerful way to > find text. The regular expression I wanted to pass in was '^--' which > means: find all lines that start with the characters --. awk will take > regular expressions too, so I could change the command to look like: > . . .2>&1 | awk '$0 ~/^--/ {print $3}' > imagesList.txt > and that would work too. > > The coolest description of awk I ever got was "basically excel with no gui" > awk splits all your text up into fields--the default dividing character is > a space, so if I want the first thing in the line, I use a $1 to say get up > to the first space. There's a space in the date here, so the actual URL is > in field 3 which is why I tell awk to execute the command {print $3} > I could also say, "get the last field in the line" since that's the URL > too by using a $NF > > The last bit, the > imagesList.txt says to make or overwrite the file > named imagesList.txt with whatever awk outputs (which is our filtered urls). > > The last line is: > for url in `cat imagesList.txt` ; do wget -O `date +%s%N`.jpg $url ; done > > this is saying, give me the text on each line in imagesList.txt, and store > them in the variable $url, and then execute the command group between do > and done until we've gone through every line in the file. > > The command between do and done is our regular old download a file with > wget, with one small modification: > > wget -O `date +%s%N`.jpg > > Anything in back-ticks (the ` character right next to 1 on most keyboards) > is an encapsulated command, and everything inside the back-ticks will be > executed as a command. Well, the command date +%s%N means give me the > current time in nanoseconds. So, each time wget is run, it'll rename the > download file to the current time in nanoseconds.jpg and then the for loop > takes over and grabs the next one. > > > > > > On Tue, Jan 27, 2015 at 2:17 PM, Stephen Partington <cryptwo...@gmail.com> > wrote: > >> you can write a script to yank out the jpg links. or just use something >> like https://addons.mozilla.org/en-US/firefox/addon/downthemall/ >> >> On Tue, Jan 27, 2015 at 12:12 PM, Michael Havens <bmi...@gmail.com> >> wrote: >> >>> H0w can I us wget to retrieve the photos here >>> <http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html>. >>> I tried: >>> >>> wget -r >>> http://the-book-of-gimp.blogspot.com/p/chapter-2-photograph-retouching.html >>> >>> but it didn't download the pictures. It downloaded a bunch of web pages. >>> :-)~MIKE~(-: >>> >>> --------------------------------------------------- >>> PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org >>> To subscribe, unsubscribe, or to change your mail settings: >>> http://lists.phxlinux.org/mailman/listinfo/plug-discuss >>> >> >> >> >> -- >> A mouse trap, placed on top of your alarm clock, will prevent you from >> rolling over and going back to sleep after you hit the snooze button. >> >> Stephen >> >> >> --------------------------------------------------- >> PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org >> To subscribe, unsubscribe, or to change your mail settings: >> http://lists.phxlinux.org/mailman/listinfo/plug-discuss >> > > > > -- > Todd Millecam > > --------------------------------------------------- > PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org > To subscribe, unsubscribe, or to change your mail settings: > http://lists.phxlinux.org/mailman/listinfo/plug-discuss > -- James *Linkedin <http://www.linkedin.com/pub/james-h-dugger/15/64b/74a/>*
--------------------------------------------------- PLUG-discuss mailing list - PLUG-discuss@lists.phxlinux.org To subscribe, unsubscribe, or to change your mail settings: http://lists.phxlinux.org/mailman/listinfo/plug-discuss