Re: Scripted URL modification for mirroring with wget

Dog's Empire Thu, 08 Jul 2004 03:39:14 -0700

   Hello, All.


> I notice that this suggestion is closely related to Dog's Empire's problem
> posted back on 12 June....
>
> I am trying to make a local copy of a website using wget.  It's working
> well, with one big problem:  many of the links in the site are passing
> around non-essential state information with varying values and sometimes
> varying order.  As a result, wget is downloading many multiple copies of
> identical pages.  On one overnight test run, I ended up with over 10,000
> identical copies of one page with filenames like these:

We got exactly same problem. We mean the reason of problem were just the
same.


> Now, obviously wget cannot be expected to know that all these different
> URLs produce the same exact file.  That sort of site-specific information
> is beyond its scope.  However, a custom external script could be written
> that contained that knowledge, if only wget was able to refer to it.

The also one point here - it's possible to user "reg" expression to filter
"filename". And we definitely needs some filter to "address string".
Regularly most sites writen using "non-static" technologies (especially
something like "forums" and etc), so "additional parameters" have to be
filtered.


> What I'm thinking of is a command-line option giving wget the name of a
> URL validation script.  Each time wget gets a URL from a page, it passes
> it to the script.  After making whatever changes it deems necessary, the
> script returns the URL (or nothing) to wget.  Then wget examines the
> returned URL to decide whether to add it to the queue.
>
> Does this sound like a useful addition to wget?

Yes. This is exactly that we do mean.

> Is there anyone familiar
> with wget internals who would be willing to implement this?

Unfortunately we can't. Otherwise we would be happy to do it. Guess, that
only creator could handle this problem, too much code to examine before make
changes.

Best regards.
Olga Lav
http://www.dogsempire.com

Re: Scripted URL modification for mirroring with wget

Reply via email to