Referrer Faking and other nifty features
Hi, I was wondering if there was any support in wget for: 1) referrer faking (i.e., wget automatically supplies a referrer based on the, well, referring page) 2) Teh regex support like in the "gold" package that I can no longer find. 3) Multi-threading. Also, I have in the past encountered a difficulty with the ~ being escaped the wrong way, has this been fixed? I know at one point one site suggested you modify url.c to "fix" this. Finally, is there a way to utilize the persistent cookie file that lynx generates to "feed" wget? Thanks, Dan Mahoney -- "I wish the Real World would just stop hassling me!" -Matchbox 20, Real World, off the album "Yourself or Someone Like You" Dan Mahoney Techie, Sysadmin, WebGeek Gushi on efnet/undernet IRC ICQ: 13735144 AIM: LarpGM Web: http://prime.gushi.org finger [EMAIL PROTECTED] for pgp public key and tel# ---
Re: Referrer Faking and other nifty features
Good morning, Please note that I am a wget user, there may be errors. On Tue, Apr 02, 2002 at 11:50:03PM -0500, Dan Mahoney, System Admin wrote: > 1) referrer faking (i.e., wget automatically supplies a referrer > based on the, well, referring page) It is the --referer option, see (wget)HTTP Options, from the Info documentation. > 2) Teh regex support like in the "gold" package that I can no longer > find. No; however you may use shell globs, see (wget)Accept/Reject Options. > 3) Multi-threading. I suppose you mean downloading several URIs in parallel. No, wget doesn't support that. Sometimes, however, one may start several wget in parallel, thanks to the shell (the & operator on Bourne shells). > Also, I have in the past encountered a difficulty with the ~ being > escaped the wrong way, has this been fixed? I know at one point one > site suggested you modify url.c to "fix" this. AFAIK, I have never had that problem; maybe it has been fixed. > Finally, is there a way to utilize the persistent cookie file that > lynx generates to "feed" wget? There is the --load-cookies=FILE option, see (wget)HTTP Options. How to read the Info documentation: type "info wget" from a shell. The "?" key may help you. Use "gHTTP Options" to go to node "HTTP Options". Have a nice day. -- fabrice bauzac Software should be free. http://www.gnu.org/philosophy/why-free.html
Re: Referrer Faking and other nifty features
On Wed, 3 Apr 2002, fabrice bauzac wrote: > Good morning, > > Please note that I am a wget user, there may be errors. > > On Tue, Apr 02, 2002 at 11:50:03PM -0500, Dan Mahoney, System Admin > wrote: > > > 1) referrer faking (i.e., wget automatically supplies a referrer > > based on the, well, referring page) > > It is the --referer option, see (wget)HTTP Options, from the Info > documentation. Yes, that allows me to specify _A_ referrer, like www.aol.com. When I'm trying to help my users mirror their old angelfire pages or something like that, very often the link has to come from the same directory. I'd like to see something where when wget follows a link to another page, or another image, it automatically supplies the URL of the page it followed to get there. Is there a way to do this? > > 2) Teh regex support like in the "gold" package that I can no longer > > find. > > No; however you may use shell globs, see (wget)Accept/Reject Options. > > > 3) Multi-threading. > > I suppose you mean downloading several URIs in parallel. No, wget > doesn't support that. Sometimes, however, one may start several wget > in parallel, thanks to the shell (the & operator on Bourne shells). No, I mean downloading multiple files from the SAME uri in parallel, instead of downloading files one-by-one-by-one (thus saving time on a fast pipe). > > > Also, I have in the past encountered a difficulty with the ~ being > > escaped the wrong way, has this been fixed? I know at one point one > > site suggested you modify url.c to "fix" this. > > AFAIK, I have never had that problem; maybe it has been fixed. I remember the problem now. I was trying to mirror homepages.go.com/~something and for whatever reason, wget would follow a link to homepages.go.com/~somethingelse and parse it out to homepages.go.com/%7esomethingelse, which for some reason the webserver DIDN'T like, and because the tilde character IS passable in a url, I was able to use the windows version, which didn't have this behavior. > > > Finally, is there a way to utilize the persistent cookie file that > > lynx generates to "feed" wget? > > There is the --load-cookies=FILE option, see (wget)HTTP Options. > > How to read the Info documentation: type "info wget" from a shell. > The "?" key may help you. Use "gHTTP Options" to go to node "HTTP > Options". Hrmm, one other thought, is there support for sftp in wget? -Dan Mahoney -- "If you aren't going to try something, then we might as well just be friends." "We can't have that now, can we?" -SK & Dan Mahoney, December 9, 1998 Dan Mahoney Techie, Sysadmin, WebGeek Gushi on efnet/undernet IRC ICQ: 13735144 AIM: LarpGM Web: http://prime.gushi.org finger [EMAIL PROTECTED] for pgp public key and tel# ---
Re: Referrer Faking and other nifty features
On 2002-04-03 08:50 -0500, Dan Mahoney, System Admin wrote: > > > 1) referrer faking (i.e., wget automatically supplies a referrer > > > based on the, well, referring page) > > > > It is the --referer option, see (wget)HTTP Options, from the Info > > documentation. > > Yes, that allows me to specify _A_ referrer, like www.aol.com. When I'm > trying to help my users mirror their old angelfire pages or something like > that, very often the link has to come from the same directory. I'd like > to see something where when wget follows a link to another page, or > another image, it automatically supplies the URL of the page it followed > to get there. Is there a way to do this? Somebody already asked for this and AFAICT, there's no way to do that. > > > 3) Multi-threading. > > > > I suppose you mean downloading several URIs in parallel. No, wget > > doesn't support that. Sometimes, however, one may start several wget > > in parallel, thanks to the shell (the & operator on Bourne shells). > > No, I mean downloading multiple files from the SAME uri in parallel, > instead of downloading files one-by-one-by-one (thus saving time on a fast > pipe). This doesn't make sense to me. When downloading from a single server, the bottleneck is generally either the server or the link ; in either case, there's nothing to win by attempting several simultaneous transfers. Unless there are several servers at the same IP and the bottleneck is the server, not the link ? -- André Majorel http://www.teaser.fr/~amajorel/> std::disclaimer ("Not speaking for my employer");
Re: Referrer Faking and other nifty features
On Wed, 3 Apr 2002, Andre Majorel wrote: > > No, I mean downloading multiple files from the SAME uri in parallel, > > instead of downloading files one-by-one-by-one (thus saving time on a > > fast pipe). > > This doesn't make sense to me. When downloading from a single server, the > bottleneck is generally either the server or the link ; in either case, > there's nothing to win by attempting several simultaneous transfers. Unless > there are several servers at the same IP and the bottleneck is the server, > not the link ? That would also be violating the RFCs in regard on to how to behave as a Good User Agent (tm). Never use more than at most 2 connections to the same host from a single client. What does spead up things though, is persistant connections. One might also argue that pipelining can do a little too. -- Daniel Stenberg - http://daniel.haxx.se - +46-705-44 31 77 ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol
Re: Referrer Faking and other nifty features
Andre Majorel wrote: > > Yes, that allows me to specify _A_ referrer, like www.aol.com. When I'm > > trying to help my users mirror their old angelfire pages or something like > > that, very often the link has to come from the same directory. I'd like > > to see something where when wget follows a link to another page, or > > another image, it automatically supplies the URL of the page it followed > > to get there. Is there a way to do this? > > Somebody already asked for this and AFAICT, there's no way to do > that Not only is it possible, it is the behavior (at least in wget 1.8.1). If you run with -d, you will see that every GET after the first one includes the appropriate referer. If I execute: wget -d -r http://www.exelana.com --referer=http://www.aol.com The first request is reported as: GET / HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.exelana.com Accept: */* Connection: Keep-Alive Referer: http://www.aol.com But, the third request is: GET /left.html HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.exelana.com Accept: */* Connection: Keep-Alive Referer: http://www.exelana.com/ The second request is for robots.txt and uses the referer from the command line. Tony
Re: Referrer Faking and other nifty features
On 2002-04-03 08:02 -0800, Tony Lewis wrote: > > > Yes, that allows me to specify _A_ referrer, like www.aol.com. > > > When I'm trying to help my users mirror their old angelfire pages > > > or something like that, very often the link has to come from the > > > same directory. I'd like to see something where when wget follows > > > a link to another page, or another image, it automatically > > > supplies the URL of the page it followed to get there. Is there a > > > way to do this? > > > > Somebody already asked for this and AFAICT, there's no way to do > > that > > Not only is it possible, it is the behavior (at least in wget 1.8.1). > If you run with -d, you will see that every GET after the first one > includes the appropriate referer. Right, for some reason I thought it used the provided referer throughout. Thanks for the correction ! -- André Majorel http://www.teaser.fr/~amajorel/> std::disclaimer ("Not speaking for my employer");
Re: Referrer Faking and other nifty features
On 3 Apr 2002 at 16:10, Andre Majorel wrote: > On 2002-04-03 08:50 -0500, Dan Mahoney, System Admin wrote: > > > > > 1) referrer faking (i.e., wget automatically supplies a referrer > > > > based on the, well, referring page) > > > > > > It is the --referer option, see (wget)HTTP Options, from the Info > > > documentation. > > > > Yes, that allows me to specify _A_ referrer, like www.aol.com. When I'm > > trying to help my users mirror their old angelfire pages or something like > > that, very often the link has to come from the same directory. I'd like > > to see something where when wget follows a link to another page, or > > another image, it automatically supplies the URL of the page it followed > > to get there. Is there a way to do this? > > Somebody already asked for this and AFAICT, there's no way to do > that. Wget is supposed to pass on the referring page when following links, but Wget 1.8 had a little bug that stopped this happening. The bug is fixed in Wget 1.8.1. ISTR that the somebody that Andre refers to wanted to be able to specify different referring URLs for each URL specified on the command-line.
Re: Referrer Faking and other nifty features
"Dan Mahoney, System Admin" <[EMAIL PROTECTED]> writes: >> It is the --referer option, see (wget)HTTP Options, from the Info >> documentation. > > Yes, that allows me to specify _A_ referrer, like www.aol.com. When > I'm trying to help my users mirror their old angelfire pages or > something like that, very often the link has to come from the same > directory. I'd like to see something where when wget follows a link > to another page, or another image, it automatically supplies the URL > of the page it followed to get there. Is there a way to do this? Doesn't Wget do so by default? >> > 3) Multi-threading. >> >> I suppose you mean downloading several URIs in parallel. No, wget >> doesn't support that. Sometimes, however, one may start several wget >> in parallel, thanks to the shell (the & operator on Bourne shells). > > No, I mean downloading multiple files from the SAME uri in parallel, > instead of downloading files one-by-one-by-one (thus saving time on > a fast pipe). Wget will almost certainly never be multithreaded, but I might introduce options to make this kind of thing easier by using multiple processes. >> > Also, I have in the past encountered a difficulty with the ~ >> > being escaped the wrong way, has this been fixed? I know at one >> > point one site suggested you modify url.c to "fix" this. >> >> AFAIK, I have never had that problem; maybe it has been fixed. > > I remember the problem now. I was trying to mirror > homepages.go.com/~something and for whatever reason, wget would > follow a link to homepages.go.com/~somethingelse and parse it out to > homepages.go.com/%7esomethingelse, which for some reason the > webserver DIDN'T like That sounds like an extremely broken web server. %xx has always been a valid URL encoding. For example, the only way to request a file with spaces in file name is to encode spaces as %20.
Re: Referrer Faking and other nifty features
> > 3) Multi-threading. >>>I suppose you mean downloading several URIs in parallel. No, wget >>>doesn't support that. Sometimes, however, one may start several wget >>>in parallel, thanks to the shell (the & operator on Bourne shells). >>> >>No, I mean downloading multiple files from the SAME uri in parallel, >>instead of downloading files one-by-one-by-one (thus saving time on >>a fast pipe). >> > >Wget will almost certainly never be multithreaded, but I might >introduce options to make this kind of thing easier by using multiple >processes. > Hi, i think for this feature there is no need for Multithreading. If the Procedures do not use Struct. It should be posible to handle that with select. That would mean an FIFO with the URL's to fetch (think this is already there) and an list of 1-16 Handles for conections witch where assinged to an function that handle them struct { int fd; int wait_read; int wait_write; int do_Write(int fd); int do_Read(int fd); } So that if one fd become -1 the "loader" take an new url and initate the download. And than shedulingwould work with the select(int,) what about this idee ? Cu Thomas Lußnig
Re: Referrer Faking and other nifty features
On 12 Apr 2002 at 17:21, Thomas Lussnig wrote: > So that if one fd become -1 the "loader" take an new url and initate the > download. > > And than shedulingwould work with the select(int,) what about this > idee ? It would certainly make handling the logging output a bit of a challenge, especially the progress indication.
Re: Referrer Faking and other nifty features
"Ian Abbott" <[EMAIL PROTECTED]> writes: > It would certainly make handling the logging output a bit of a > challenge, especially the progress indication. It would also require a completely different sort of organization, one based on a central event loop. There are programs that work that way, such as `lftp', but Wget is not one of them and I don't think it will become one any time soon. I would much prefer to invest time into writing better http and ftp backends, and supporting more protocols.
Re: Referrer Faking and other nifty features
> > >It would also require a completely different sort of organization, one >based on a central event loop. There are programs that work that way, >such as `lftp', but Wget is not one of them and I don't think it will >become one any time soon. > >I would much prefer to invest time into writing better http and ftp >backends, and supporting more protocols. > There are 2 Protocols wich are very similar and one of them maybe interesting (NNTP and IMAP) Because both of them could be represented in the same url style like http and ftp they also have folders and "files". What also could be nice is access to the p2p networks. But there is the "leech" problem. Because wget is only intented as GET an not share. Or what would beintersting protocols to support ? Cu Thomas Lußnig smime.p7s Description: S/MIME Cryptographic Signature