Referrer Faking and other nifty features

2002-04-02 Thread Dan Mahoney, System Admin

Hi, I was wondering if there was any support in wget for:

1) referrer faking (i.e., wget automatically supplies a referrer based on
the, well, referring page)

2) Teh regex support like in the "gold" package that I can no longer find.

3) Multi-threading.

Also, I have in the past encountered a difficulty with the ~ being escaped
the wrong way, has this been fixed?  I know at one point one site
suggested you modify url.c to "fix" this.

Finally, is there a way to utilize the persistent cookie file that lynx
generates to "feed" wget?

Thanks,

Dan Mahoney

--

"I wish the Real World would just stop hassling me!"

-Matchbox 20, Real World, off the album "Yourself or Someone Like You"


Dan Mahoney
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Web: http://prime.gushi.org
finger [EMAIL PROTECTED]
for pgp public key and tel#
---





Re: Referrer Faking and other nifty features

2002-04-02 Thread fabrice bauzac

Good morning,

Please note that I am a wget user, there may be errors.

On Tue, Apr 02, 2002 at 11:50:03PM -0500, Dan Mahoney, System Admin
wrote:

> 1) referrer faking (i.e., wget automatically supplies a referrer
> based on the, well, referring page)

It is the --referer option, see (wget)HTTP Options, from the Info
documentation.

> 2) Teh regex support like in the "gold" package that I can no longer
> find.

No; however you may use shell globs, see (wget)Accept/Reject Options.

> 3) Multi-threading.

I suppose you mean downloading several URIs in parallel.  No, wget
doesn't support that.  Sometimes, however, one may start several wget
in parallel, thanks to the shell (the & operator on Bourne shells).

> Also, I have in the past encountered a difficulty with the ~ being
> escaped the wrong way, has this been fixed?  I know at one point one
> site suggested you modify url.c to "fix" this.

AFAIK, I have never had that problem; maybe it has been fixed.

> Finally, is there a way to utilize the persistent cookie file that
> lynx generates to "feed" wget?

There is the --load-cookies=FILE option, see (wget)HTTP Options.

How to read the Info documentation: type "info wget" from a shell.
The "?" key may help you.  Use "gHTTP Options" to go to node "HTTP
Options".

Have a nice day.

-- 
fabrice bauzac
Software should be free.  http://www.gnu.org/philosophy/why-free.html



Re: Referrer Faking and other nifty features

2002-04-03 Thread Dan Mahoney, System Admin

On Wed, 3 Apr 2002, fabrice bauzac wrote:

> Good morning,
>
> Please note that I am a wget user, there may be errors.
>
> On Tue, Apr 02, 2002 at 11:50:03PM -0500, Dan Mahoney, System Admin
> wrote:
>
> > 1) referrer faking (i.e., wget automatically supplies a referrer
> > based on the, well, referring page)
>
> It is the --referer option, see (wget)HTTP Options, from the Info
> documentation.

Yes, that allows me to specify _A_ referrer, like www.aol.com.  When I'm
trying to help my users mirror their old angelfire pages or something like
that, very often the link has to come from the same directory.  I'd like
to see something where when wget follows a link to another page, or
another image, it automatically supplies the URL of the page it followed
to get there.  Is there a way to do this?

> > 2) Teh regex support like in the "gold" package that I can no longer
> > find.
>
> No; however you may use shell globs, see (wget)Accept/Reject Options.
>
> > 3) Multi-threading.
>
> I suppose you mean downloading several URIs in parallel.  No, wget
> doesn't support that.  Sometimes, however, one may start several wget
> in parallel, thanks to the shell (the & operator on Bourne shells).

No, I mean downloading multiple files from the SAME uri in parallel,
instead of downloading files one-by-one-by-one (thus saving time on a fast
pipe).

>
> > Also, I have in the past encountered a difficulty with the ~ being
> > escaped the wrong way, has this been fixed?  I know at one point one
> > site suggested you modify url.c to "fix" this.
>
> AFAIK, I have never had that problem; maybe it has been fixed.

I remember the problem now.  I was trying to mirror
homepages.go.com/~something and for whatever reason, wget would follow a
link to homepages.go.com/~somethingelse and parse it out to
homepages.go.com/%7esomethingelse, which for some reason the webserver
DIDN'T like, and because the tilde character IS passable in a url, I was
able to use the windows version, which didn't have this behavior.

>
> > Finally, is there a way to utilize the persistent cookie file that
> > lynx generates to "feed" wget?
>
> There is the --load-cookies=FILE option, see (wget)HTTP Options.
>
> How to read the Info documentation: type "info wget" from a shell.
> The "?" key may help you.  Use "gHTTP Options" to go to node "HTTP
> Options".

Hrmm, one other thought, is there support for sftp in wget?

-Dan Mahoney

--

"If you aren't going to try something, then we might as well just be
friends."

"We can't have that now, can we?"

-SK & Dan Mahoney,  December 9, 1998

Dan Mahoney
Techie,  Sysadmin,  WebGeek
Gushi on efnet/undernet IRC
ICQ: 13735144   AIM: LarpGM
Web: http://prime.gushi.org
finger [EMAIL PROTECTED]
for pgp public key and tel#
---





Re: Referrer Faking and other nifty features

2002-04-03 Thread Andre Majorel

On 2002-04-03 08:50 -0500, Dan Mahoney, System Admin wrote:

> > > 1) referrer faking (i.e., wget automatically supplies a referrer
> > > based on the, well, referring page)
> >
> > It is the --referer option, see (wget)HTTP Options, from the Info
> > documentation.
> 
> Yes, that allows me to specify _A_ referrer, like www.aol.com.  When I'm
> trying to help my users mirror their old angelfire pages or something like
> that, very often the link has to come from the same directory.  I'd like
> to see something where when wget follows a link to another page, or
> another image, it automatically supplies the URL of the page it followed
> to get there.  Is there a way to do this?

Somebody already asked for this and AFAICT, there's no way to do
that.

> > > 3) Multi-threading.
> >
> > I suppose you mean downloading several URIs in parallel.  No, wget
> > doesn't support that.  Sometimes, however, one may start several wget
> > in parallel, thanks to the shell (the & operator on Bourne shells).
> 
> No, I mean downloading multiple files from the SAME uri in parallel,
> instead of downloading files one-by-one-by-one (thus saving time on a fast
> pipe).

This doesn't make sense to me. When downloading from a single
server, the bottleneck is generally either the server or the link
; in either case, there's nothing to win by attempting several
simultaneous transfers. Unless there are several servers at the
same IP and the bottleneck is the server, not the link ?

-- 
André Majorel http://www.teaser.fr/~amajorel/>
std::disclaimer ("Not speaking for my employer");



Re: Referrer Faking and other nifty features

2002-04-03 Thread Daniel Stenberg

On Wed, 3 Apr 2002, Andre Majorel wrote:

> > No, I mean downloading multiple files from the SAME uri in parallel,
> > instead of downloading files one-by-one-by-one (thus saving time on a
> > fast pipe).
>
> This doesn't make sense to me. When downloading from a single server, the
> bottleneck is generally either the server or the link ; in either case,
> there's nothing to win by attempting several simultaneous transfers. Unless
> there are several servers at the same IP and the bottleneck is the server,
> not the link ?

That would also be violating the RFCs in regard on to how to behave as a Good
User Agent (tm). Never use more than at most 2 connections to the same host
from a single client.

What does spead up things though, is persistant connections. One might also
argue that pipelining can do a little too.

-- 
  Daniel Stenberg - http://daniel.haxx.se - +46-705-44 31 77
   ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol




Re: Referrer Faking and other nifty features

2002-04-03 Thread Tony Lewis

Andre Majorel wrote:

> > Yes, that allows me to specify _A_ referrer, like www.aol.com.  When I'm
> > trying to help my users mirror their old angelfire pages or something
like
> > that, very often the link has to come from the same directory.  I'd like
> > to see something where when wget follows a link to another page, or
> > another image, it automatically supplies the URL of the page it followed
> > to get there.  Is there a way to do this?
>
> Somebody already asked for this and AFAICT, there's no way to do
> that

Not only is it possible, it is the behavior (at least in wget 1.8.1). If you
run with -d, you will see that every GET after the first one includes the
appropriate referer.

If I execute: wget -d -r http://www.exelana.com --referer=http://www.aol.com

The first request is reported as:
GET / HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.exelana.com
Accept: */*
Connection: Keep-Alive
Referer: http://www.aol.com

But, the third request is:
GET /left.html HTTP/1.0
User-Agent: Wget/1.8.1
Host: www.exelana.com
Accept: */*
Connection: Keep-Alive
Referer: http://www.exelana.com/

The second request is for robots.txt and uses the referer from the command
line.

Tony




Re: Referrer Faking and other nifty features

2002-04-03 Thread Andre Majorel

On 2002-04-03 08:02 -0800, Tony Lewis wrote:

> > > Yes, that allows me to specify _A_ referrer, like www.aol.com.
> > > When I'm trying to help my users mirror their old angelfire pages
> > > or something like that, very often the link has to come from the
> > > same directory. I'd like to see something where when wget follows
> > > a link to another page, or another image, it automatically
> > > supplies the URL of the page it followed to get there.  Is there a
> > > way to do this?
> >
> > Somebody already asked for this and AFAICT, there's no way to do
> > that
> 
> Not only is it possible, it is the behavior (at least in wget 1.8.1).
> If you run with -d, you will see that every GET after the first one
> includes the appropriate referer.

Right, for some reason I thought it used the provided referer
throughout. Thanks for the correction !

-- 
André Majorel http://www.teaser.fr/~amajorel/>
std::disclaimer ("Not speaking for my employer");



Re: Referrer Faking and other nifty features

2002-04-03 Thread Ian Abbott

On 3 Apr 2002 at 16:10, Andre Majorel wrote:

> On 2002-04-03 08:50 -0500, Dan Mahoney, System Admin wrote:
> 
> > > > 1) referrer faking (i.e., wget automatically supplies a referrer
> > > > based on the, well, referring page)
> > >
> > > It is the --referer option, see (wget)HTTP Options, from the Info
> > > documentation.
> > 
> > Yes, that allows me to specify _A_ referrer, like www.aol.com.  When I'm
> > trying to help my users mirror their old angelfire pages or something like
> > that, very often the link has to come from the same directory.  I'd like
> > to see something where when wget follows a link to another page, or
> > another image, it automatically supplies the URL of the page it followed
> > to get there.  Is there a way to do this?
> 
> Somebody already asked for this and AFAICT, there's no way to do
> that.

Wget is supposed to pass on the referring page when following
links, but Wget 1.8 had a little bug that stopped this happening.
The bug is fixed in Wget 1.8.1.

ISTR that the somebody that Andre refers to wanted to be able to
specify different referring URLs for each URL specified on the
command-line.



Re: Referrer Faking and other nifty features

2002-04-12 Thread Hrvoje Niksic

"Dan Mahoney, System Admin" <[EMAIL PROTECTED]> writes:

>> It is the --referer option, see (wget)HTTP Options, from the Info
>> documentation.
>
> Yes, that allows me to specify _A_ referrer, like www.aol.com.  When
> I'm trying to help my users mirror their old angelfire pages or
> something like that, very often the link has to come from the same
> directory.  I'd like to see something where when wget follows a link
> to another page, or another image, it automatically supplies the URL
> of the page it followed to get there.  Is there a way to do this?

Doesn't Wget do so by default?

>> > 3) Multi-threading.
>>
>> I suppose you mean downloading several URIs in parallel.  No, wget
>> doesn't support that.  Sometimes, however, one may start several wget
>> in parallel, thanks to the shell (the & operator on Bourne shells).
>
> No, I mean downloading multiple files from the SAME uri in parallel,
> instead of downloading files one-by-one-by-one (thus saving time on
> a fast pipe).

Wget will almost certainly never be multithreaded, but I might
introduce options to make this kind of thing easier by using multiple
processes.

>> > Also, I have in the past encountered a difficulty with the ~
>> > being escaped the wrong way, has this been fixed?  I know at one
>> > point one site suggested you modify url.c to "fix" this.
>>
>> AFAIK, I have never had that problem; maybe it has been fixed.
>
> I remember the problem now.  I was trying to mirror
> homepages.go.com/~something and for whatever reason, wget would
> follow a link to homepages.go.com/~somethingelse and parse it out to
> homepages.go.com/%7esomethingelse, which for some reason the
> webserver DIDN'T like

That sounds like an extremely broken web server.  %xx has always been
a valid URL encoding.  For example, the only way to request a file
with spaces in file name is to encode spaces as %20.



Re: Referrer Faking and other nifty features

2002-04-12 Thread Thomas Lussnig

>
>
3) Multi-threading.

>>>I suppose you mean downloading several URIs in parallel.  No, wget
>>>doesn't support that.  Sometimes, however, one may start several wget
>>>in parallel, thanks to the shell (the & operator on Bourne shells).
>>>
>>No, I mean downloading multiple files from the SAME uri in parallel,
>>instead of downloading files one-by-one-by-one (thus saving time on
>>a fast pipe).
>>
>
>Wget will almost certainly never be multithreaded, but I might
>introduce options to make this kind of thing easier by using multiple
>processes.
>
Hi,
i think for this feature there is no need for Multithreading. If the 
Procedures do not use Struct.
It should be posible to handle that with select. That would mean an FIFO 
with the URL's to fetch
(think this is already there) and an list of 1-16 Handles for conections 
witch where assinged to an
function that handle them

struct {
int fd;
int wait_read;
int wait_write;
int do_Write(int fd);
int do_Read(int fd);
}

So that if one fd become -1 the "loader" take an new url and initate the 
download.

And than shedulingwould work with the select(int,) what about this 
idee ?

Cu Thomas Lußnig




Re: Referrer Faking and other nifty features

2002-04-12 Thread Ian Abbott

On 12 Apr 2002 at 17:21, Thomas Lussnig wrote:

> So that if one fd become -1 the "loader" take an new url and initate the 
> download.
> 
> And than shedulingwould work with the select(int,) what about this 
> idee ?

It would certainly make handling the logging output a bit of a
challenge, especially the progress indication.



Re: Referrer Faking and other nifty features

2002-04-12 Thread Hrvoje Niksic

"Ian Abbott" <[EMAIL PROTECTED]> writes:

> It would certainly make handling the logging output a bit of a
> challenge, especially the progress indication.

It would also require a completely different sort of organization, one
based on a central event loop.  There are programs that work that way,
such as `lftp', but Wget is not one of them and I don't think it will
become one any time soon.

I would much prefer to invest time into writing better http and ftp
backends, and supporting more protocols.



Re: Referrer Faking and other nifty features

2002-04-12 Thread Thomas Lussnig

>
>
>It would also require a completely different sort of organization, one
>based on a central event loop.  There are programs that work that way,
>such as `lftp', but Wget is not one of them and I don't think it will
>become one any time soon.
>
>I would much prefer to invest time into writing better http and ftp
>backends, and supporting more protocols.
>
There are 2 Protocols wich are very similar and one of them maybe 
interesting (NNTP and IMAP)
Because both of them could be represented in the same url style like 
http and ftp they also have
folders and "files".

What also could be nice is access to the p2p networks. But there is the 
"leech" problem. Because wget is only
intented as GET an not share.

Or what would beintersting protocols to support ?

Cu Thomas Lußnig



smime.p7s
Description: S/MIME Cryptographic Signature