forum download, cookies?
A forum has topics which are available only for members. How to use wget for downloading copy of the pages in that case? How to get the proper cookies and how to get wget to use them correctly? I use IE in PC/Windows and wget in a unix computer. I could use Lynx in the unix computer if needed. (PC/Windows has Firefox but I cannot install anything new. If Firefox has a downloader plugin suitable for forum downloading, that would be ok.) Juhana
Bug in 1.10.2 vs 1.9.1
Hello. Wget 1.10.2 has the following bug compared to version 1.9.1. First, the bin/wgetdir is defined as wget -p -E -k --proxy=off -e robots=off --passive-ftp -o zlogwget`date +%Y%m%d%H%M%S` -r -l 0 -np -U Mozilla --tries=50 --waitretry=10 $@ The download command is wgetdir http://udn.epicgames.com Version 1.9.1 result: download ok Version 1.10.2 result: only udn.epicgames.com/Main/WebHome downloaded and other converted urls are of the form http://udn.epicgames.com/../Two/WebHome Juhana
url accept/reject? accept scripts
Hello. How do I get wget to ignore urls containing one of the following strings? The --help did not reveal a suitable option, surprisingly. action= printable= redirect= article= returnto= title= I would like to remind about the problems with the existing options: (1) I downloaded an ftp site with --accept=pdf,PDF but only PDF files were downloaded. (2) I downloaded a http site with -X forums,wiki but only one was excluded. Therefore, I would like to have an example about how the LIST options are exactly typed. The options are good but a more general solution would be to have a script callback system with access to the wget variables. Example: wget --accept-script=wikiurls.script other options http://nwn2wiki.org/Main_Page.html wikiurls.script could be: accept = 1; if (string_has(CurrentURL,action=) == 0) accept = 0; if (string_has(CurrentURL,printable=) == 0) accept = 0; if (string_has(CurrentURL,redirect=) == 0) accept = 0; if (string_has(CurrentURL,article=) == 0) accept = 0; if (string_has(CurrentURL,returnto=) == 0) accept = 0; if (string_has(CurrentURL,title=) == 0) accept = 0; if (accept == 0) fprintf(rejectedfp,%s\n,CurrentURL); return accept; Other script for downloading images could be: if (string_has(CurrentURL,/thumbs/) == 0) { newurl = strdup(CurrentURL); string_delete(newurl,thumbs/); queue_url(newurl); } else if (string_has(CurrentURL,_small) == 0) { newurl = strdup(CurrentURL); string_replace(newurl,_small,_large); queue_url(newurl); } accept = 1; return accept; Perhaps not that easy but the idea is there. Other script types could be parser scripts, e.g., for additional parsing of OpenWindow('page.html'); and OpenImage('image.jpg'); java calls. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software
wget server?
Hello. The following problem occured recently. I started downloading all under directory http://site.edu/projects/software/ Then after a day I found that the subdirectory http://site.edu/projects/software/program/manual/ had a wiki with millions of files. Because I wished that the download continues to other directories, I did not interrupt the wget. After a week and half the wget quitted for having no memory. As a solution, could wget turned to a wget server? A server doing the downloads and a client program wget. The command wget url would send the url and the current directory to the server. The above problem would be solved by allowing user to add rejects on-the-fly, e.g., wget --add-reject http://site.edu/projects/software/program/manual/ Then the server would start skipping the queued urls of the manual and eventually end up to other directories. Client/server model would allow more useful features. I often download many individual directories from one site. Now all downloads runs in parallel as background processes because I don't want wait and stop what I'm doing. The server could by default queue all downloads for the site and download one url at a time. Downloads from different sites would still be downloaded in parallel in the server. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software
news protocol?
Hello. The TODO lists the following: * Add more protocols (e.g. gopher and news), implementing them in a modular fashion. Do you mean nntp protocol? If yes, I recently wrote an nntp downloader: http://www.funet.fi/~kouhia/nntppull20060409.tar.gz I find it good for news archiving. I now archive nearly 700 newsgroups. But what kind of plans you have? What I still need is a way to download newsgroup archives from Google. I have a free project for which I would like to have all from 2 to 4 groups. I'm not aware of any other archive, public or private, who could help me. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software
accepted and excluded?
Hello. How I would type the -A option if I want both .pdf and .PDF files from an ftp site? -A pdf,PDF failed -- only PDF files were downloaded. How I would type -X option if I want multiple subdirectories excluded? -X dir1,dir2 failed -- only one of the given dirs was excluded. (E.g. www.site.dom/dir1/ and www.site.dom/dir2/ should be excluded when the whole site is downloaded.) I need now only the exact working options as I'm not sure what comma-separated list of accepted extensions now means. I will investigate later if and why the options fails. I'm using latest wget as far as I know -- version 1.9.1. Juhana
wget with a log database?
Hello. I would like to have a database within wget. The database would let wget know what it has downloaded earlier. Wget could download only new and changed files, and could continue the download without having the old downloadings in my disk. The database would also be accessed by other programs. E.g., new downloads could be later merged to the earlier downloads with another program. E.g., the database would allow me to remember when I'm trying to download something I already have. Do we have such a downloader with the requested features available already? Could somebody install and test the Nedlib Harvester at http://www.csc.fi/sovellus/nedlib/ The NH was used to download all webpages at Finland (totalling 400GB). I don't know if NH has all the features I need. E.g., I would like to associate includes and excludes to individual sites and webpage structures. When I next time update my copy, the downloader would use the given includes and excludes. Juhana
wget problem
Hello. The following document could not be downloaded at all: http://www.greyc.ensicaen.fr/~dtschump/greycstoration/ If you succeed, please tell me how. I want all the html file and the images. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software
Re: Help Needed
Hello. Does wget have a nntp (Usenet newsgroups) support? For example, I might want download all articles between numbers M and N. A date based system could be useful too. We just should agree how these queries are represented to wget. I can dig out an old Usenet news downloader code if wget does not have one yet. Juhana
on tilde bug
Hello. I traced the url given at command line, and it looks like there is no difference if one gives ~ or %7E. Is this true? The urls end up to url_parse() which switches ~ (as unsafe) to %7E. If the original url is not used at all as it looks like, then there is no difference. But mysteriously the url with ~ and the url with %7E downloaded files differently! I also added new log outputs and while testing them with the problem sites, surprise, there seemed to be no problems. So, the fact that urls are not downloaded, could be just some code bug in wget. But why this problem appears when ~ is in the download url? Have I just missed the other cases? Or is the bug in the code which expands the unsafe characters? Juhana
char 5C problem
Hello. Wget could not download the images of the page http://www.fusionindustries.com/alex/combustion/index.html The image urls have %5C (backslash \) in them. http://www.fusionindustries.com/alex/combustion/small%5C0103%20edgepoint-pressure%20small.png http://www.fusionindustries.com/alex/combustion/big%5C0103%20edgepoint-pressure.png The wget 1.9.1 options used included -E, -r, -k, and -np. Juhana
Tilde bug again
Hello. Has the ~ / %7E bug been always in wget? When it was added to wget? Who wrote the code? I would like to suggest that the person who made this severe bug should immediately fix it back. It does not make sense that we waste time in trying to fix this bug if the person did not use any moment to design the feature and think of the consequences. It is better that the original code is restored and that the person uses plenty of time in redesigning the feature if he wishes to get it back!! Sorry about this bad tone, but if the bug is not fixed, then we must restore the original code as soon as possible. (Here the bug is hitting almost everytime when the url has ~ in it.) PS. I willing to look at it myself, but: what piece of code changes %7E to ~ in the case where the given url has %7E but the ~ appears to my disk? In what format url is saved for the -np option, by what routine? What format commands are sent to the server, with ~ only or with %7E only, or both? I'm not sure if these questions help but that is a start. Juhana
Directory indecies?
Hello. Why wget generates the following index files? Why so many index files? ftp1.sourceforge.net/gut/index.html ftp1.sourceforge.net/gut/index.html?C=MO=A ftp1.sourceforge.net/gut/index.html?C=MO=D ftp1.sourceforge.net/gut/index.html?C=NO=A ftp1.sourceforge.net/gut/index.html?C=NO=D ftp1.sourceforge.net/gut/index.html?C=SO=A ftp1.sourceforge.net/gut/index.html?C=SO=D Juhana
img dynsrc not downloaded?
Hello. Wget could not follow dynsrc tags; the mpeg file was not downloaded: pimg dynsrc=Collision.mpg CONTROLS LOOP=1 at http://www.wideopenwest.com/~nkuzmenko7225/Collision.htm Regards, Juhana
xml files not processed?
Hello. When the url http://zeus.fri.uni-lj.si/%7Ealeks/POIS/Kolaborativno%20delo.htm is downloaded with -np -r -l 0 etc., the file http://zeus.fri.uni-lj.si/~aleks/POIS/Kolaborativno delo_files/filelist.xml is downloaded correctly. However, the hrefs in the xml file are not then followed: o:File HRef=slide0008.htm/ o:File HRef=slide0008_image001.png/ o:File HRef=slide0008_image002.jpg/ o:File HRef=slide0011.htm/ Note that pres.xml file in the same directory has href=c:\temp\Kolaborativno delo.htm which is apparently incorrect and should ignored. The refered file is in the first url given in this mail. These xml files are apparently generated by PowerPoint-to-html converter. Regards, Juhana
Developers here?
Hello. Recent mails has not been replied and CVS may be old. Who are the developers of wget at the moment? I just posted a couple of featureloss reports, but my intend is not to pour the tasks on the current developers. However, without anyone giving hints on what to look at, the features may go unimplemented by me. Regards, Juhana
wget scripting?
Hello. I have slightly thought how to make wget more better, possibly. We would need a scripting system so that features can be programmed more easily. One way how to incorporate the scripting to wget would be to re-write wget as a data flow system. Much similar way than OpenGL (www.opengl.org) is a data flow for graphics. The scripts would be executed in specific places in the data flow graph. Much similar way than vertex and fragment programs are executed in OpenGL in the specific places of the graph. So, the urls would enter the data flow and the routines in the graph would do something to them. I don't know yet what kind of graph we would have but here is a simple one: url input -- url processing -- site exclusion -- dir path exclusion -- get file -- Then the graph goes deeper in parsing the html. Example: I could add a script just after the get file. The script would uncompress the downloaded file to a new file and change the local_filename variable to the name of the new file. The graph would make it possible to use different granularity. Details can be added by splitting the large graph nodes later. Regards, Juhana
compressed html files?
Hello. The file http://www.cs.utah.edu/~gooch/JOT/index.html is compressed and wget could not follow the urls in it. What can be done? Should wget uncompress the compressed *.htm and *.html files? *.asp, *.php?? Juhana
Character coding gives problems
Hello. Char coding of ~ causes problems in downloading. Example: wget -p -E -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np http://www.stanford.edu/~dattorro/ However, not all was downloaded. The file machines.html has hrefs http://www.stanford.edu/%7Edattorro/images/calloph.jpg http://www.stanford.edu/%7Edattorro/Lexicon.htm instead of the correct urls images/calloph.jpg Lexicon.htm and these two files were not downloaded. In fact, 52 files were downloaded and 105 files were not!! A major bug. Simply, ~ and %7E should be treated as the same char. Otherwise there is no point in % MIME(?) codings at all. For local filenames, only one of ~ and %7E should be used. I would prefer MIME codings because Linux scripts (e.g., for i in `find`), cannot handle spaces in filenames. Has this problem been fixed already? Is there any quick solution before the problem is fixed? Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software
wget problem: urls behind script
Hello. One wget problem this time. I downloaded all in http://www.planetunreal.com/wod/tutorials/ but most of the files were not downloaded because urls are in the file http://www.planetunreal.com/wod/tutorials/sidebar.js in the following format FItem(Beginner's Guide to UnrealScript, guide.htm); Item(Class Tree, classtree.htm); Item(Download the MASSIVE all inclusive UScript Tutorial, UScript Tutorial.doc); Item(My First Mod (Part 1), 1stmod.html); Could wget test each string inside the function call if the string is a file in the directory? And then wget would continue processing the file if it exists. In the above example, wget would try to download additionally the following files: http://www.planetunreal.com/wod/tutorials/Beginner's Guide to UnrealScript http://www.planetunreal.com/wod/tutorials/guide.htm http://www.planetunreal.com/wod/tutorials/classtree.htm http://www.planetunreal.com/wod/tutorials/Class Tree http://www.planetunreal.com/wod/tutorials/Download the MASSIVE all inclusive UScript Tutorial http://www.planetunreal.com/wod/tutorials/UScript Tutorial.doc http://www.planetunreal.com/wod/tutorials/My First Mod (Part 1) http://www.planetunreal.com/wod/tutorials/1stmod.html It could be that the webserver reports file not found errors and generates an error page. How to prevent that those pages are not saved? E.g., I should not get the file My First Mod (Part 1).html containing the errors. Regards, Juhana
wget bug: directory overwrite
Hello. Problem: When downloading all in http://udn.epicgames.com/Technical/MyFirstHUD wget overwrites the downloaded MyFirstHUD file with MyFirstHUD directory (which comes later). GNU Wget 1.9.1 wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@ Solution: Use of -E option. Regards, Juhana
Bug report
Hello. This is report on some wget bugs. My wgetdir command looks the following (wget 1.9.1): wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@ Bugs: Command: wgetdir http://www.directfb.org;. Problem: In file www.directfb.org/index.html the hrefs of type /screenshots/index.xml was not converted to relative with -k option. Command: wgetdir http://threedom.sourceforge.net;. Problem: In file threedom.sourceforge.net/index.html the hrefs were not converted to relative with -k option. Command: wgetdir http://liarliar.sourceforge.net;. Problem: Files are named as content.php?content.2 content.php?content.3 content.php?content.4 which are interpreted, e.g., by Nautilus as manual pages and are displayed as plain texts. Could the files and the links to them renamed as the following? content.php?content.2.html content.php?content.3.html content.php?content.4.html After all, are those pages still php files or generated html files? If they are html files produced by the php files, then it could be a good idea to add a new extension to the files. Command: wgetdir http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html; Problem: Images are not downloaded. Perhaps because the image links are the following: image src=v26_2.jpg Regards, Juhana
will mime coding make the site different?
Hello. I downloaded http://agar.csoft.org/index.html with -k option, but the URL http://agar.csoft.org/man.cgi?query=widgetamp;sektion=3 in the file was not converted to relative. (The local filename is man.cgi?query=widgetsektion=3.) Regards, Juhana
Re: not downloading at all, help
--16:59:21-- http://www.maqamworld.com:80/ = `index.html' Connecting to www.maqamworld.com:80... connected! It looks like you have http_proxy=80 in your wgetrc file. I placed use_proxy = off to .wgetrc (which file I did not have earlier) and to ~/wget/etc/wgetrc (which file I had), and tried wget --proxy=off http://www.maqamworld.com and it still does not work. Could there be some system wgetrc files somewhere? I have compiled wget on my own to my home directory, and certainly wish that my own installation does not use files of some other installation. Why did you think the :80 comes from proxy? I have always thought it comes from the target site, not from our site. Did you try the given command yourself and it worked? Please try now if you did not. If wget puts the :80 , then how do I instruct wget to not do that no matter what is told somewhere? What part of the source code I should edit if that is only what helps? Though, you should fix this to the wget source because something is not working now. I wonder why this not working is set as a default behaviour to wget... Regards, Juhana
not downloading at all, help
Hello. What goes wrong in the following? (I will read replies from the list archives.) % wget http://www.maqamworld.com/ --16:59:21-- http://www.maqamworld.com:80/ = `index.html' Connecting to www.maqamworld.com:80... connected! HTTP request sent, awaiting response... 503 Unknown site 16:59:21 ERROR 503: Unknown site. Regards, Juhana