recursive mirror of new children w/old parent broken (wget-1.8)

2002-01-28 Thread Vassilii Khachaturov

Hi,
I am using the following script to mirror pages from
http://newton.uam.mx/xgeorge/
in the local directory (served as http://www.tarunz.org/~xgeorge/):

wget -q -t10 -c -T120 -nH --cut-dirs=1 -r -m -np http://newton.uam.mx/xgeorge/

Now, if the remote directory index document (mapped to the local
index.html) is not newer than its mirrored local copy, the mirror
process stops; whereas I am expecting it to also check if the
child pages have to be mirrored. This is exhibited by the attached
'stale.log.gz'. If I forcibly remove the local index.html,
it does mirror index.html, and then proceeds on to properly mirroring
the changed child documents (as seen is the second attachment,
'forced.log.gz').

I hope this is enough to describe the bug;
but I will be happy to answer any additional questions
or test any patch for you.

Kind regards,
Vassilii



stale.log.gz
Description: stale.log (GZIP-ped)


forced.log.gz
Description: forced.log (GZIP-ped)


Re: recursive mirror of new children w/old parent broken (wget-1.8)

2002-01-29 Thread Vassilii Khachaturov

This workaround won't solve the generic case, IMHO.
As far as I understand (unless I'm missing something obvious), 
right now, when timestamping is on,
any children of a document which is decided to be already mirrored are never
mirrored. So, say, if there's a second generation document that is mirrored,
but its descendants include some other page to be mirrored, this means that
I have to *guess in advance* what branch this is going to occur on to invoke
the workaround approach (and to forcibly re-mirror the document which is
already mirrored correctly). This renders the whole mirroring logic
fundamentally flawed, doesn't it? Mirroring is exactly for PREVENTING
of re-download of files which are already downloaded correctly. Current
wget only supports this for the "leaf" pages correctly.

The solution IMHO should be simple - in case a file is deemed already mirrored,
the download should be replaced by a local read, with all associated parsing
and queuing of the children. This is what other mirroring software does
(cf. the mm+mirror suite in Perl). I think this should be the default logic with
-m; I can't imagine a scenario when "leaf mirroring", the current state of affairs,
would be preferred.

Vassilii
- Original Message - 
From: "Herold Heiko" <[EMAIL PROTECTED]>
To: "'Vassilii Khachaturov'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, January 29, 2002 2:52 AM
Subject: RE: recursive mirror of new children w/old parent broken (wget-1.8)


In other words, you'd want to NOT do timestamping (-N or --timestamping,
implicitely turned on by -m) for the first file but do it for later
ones. I don't think you can do that currently with one invocation of
wget.

What you could do is change your script - download (even with -m or -N)
that first file only (without recursion, no -r); then start wget a
second time with 
-m -r -i path_to_index.html -B http://newton.uam.mx/xgeorge/ 
in order to check all those secondary links. Should not be a problem
since you do use a script anyway.

Heiko

> -Original Message-
> From: Vassilii Khachaturov [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, January 29, 2002 12:11 AM
> To: [EMAIL PROTECTED]
> Subject: recursive mirror of new children w/old parent broken 
> (wget-1.8)
> 
> 
> Hi,
> I am using the following script to mirror pages from
> http://newton.uam.mx/xgeorge/
> in the local directory (served as http://www.tarunz.org/~xgeorge/):
> 
> wget -q -t10 -c -T120 -nH --cut-dirs=1 -r -m -np 
> http://newton.uam.mx/xgeorge/
> 
> Now, if the remote directory index document (mapped to the local
> index.html) is not newer than its mirrored local copy, the mirror
> process stops; whereas I am expecting it to also check if the
> child pages have to be mirrored. This is exhibited by the attached
> 'stale.log.gz'. If I forcibly remove the local index.html,
> it does mirror index.html, and then proceeds on to properly mirroring
> the changed child documents (as seen is the second attachment,
> 'forced.log.gz').
> 
> I hope this is enough to describe the bug;
> but I will be happy to answer any additional questions
> or test any patch for you.
> 
> Kind regards,
> Vassilii