On 12/27/22 16:18, American Citizen wrote: > Hi > > I used wget recently to try to download all 26 or 27 pages of my > website, but it seems to miss about 40% of the pages. > > Does anyone have the CLI command line which captures 100% of a website > URLS ? > > I tried the typical > > %wget -r --tries=10 https://my.website.com/ -o logfile > > as suggested in the "man wget" command, but it did NOT capture all the > webpages. I even tried a wait parameter, but that only slowed things up > and did not remedy the missing websubpages issue. > > I appreciate any tips so that ALL of the website data can be captured by > wget. Yes, I am aware of the robots.txt restricting downloadable information > > - Randall > >
wget can be a bit tricky - it has a lot of options for downloading websites. For your case, how many directories deep is your website? By default, the directory level is 5. try wget -r -l 10 --tries=10 https://my.website.com/ -o logfile for 10 levels deep, or adjust as needed. To make a offline copy of the website, you can use 'mirror' instead wget --mirror --tries=10 https://my.website.com/ -o logfile or wget --mirror \ --convert-links \ --html-extension \ --wait=2 \ -o logfile \ https://my.website.com/ '--html-extension' is handy if some of your pages do not conform to *.html. Use '--convert-links' for offline viewing in a browser. Some other options that may be handy: -p (--page-requisites) : download all files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. -H (--span-hosts) : enable spanning across hosts when doing recursive retrieving. --no-parent : When recursing do not ascend to the parent directory. It useful for restricting the download to only a portion of the site. Also, be aware that some linux distros symlink wget to wget2 which operates a bit differently. -Ed
