Re: MirrorManager crawler patch

2009-07-20 Thread Bruno Wolff III
On Mon, Jul 20, 2009 at 01:27:11 -0400,
  Ricky Zhou ri...@fedoraproject.org wrote:
 2) MirrorManager currently doesn't check timestamps, and the solution to
this isn't trivial, especially since with FTP, which returns
directory listing data as just the text of the output.  This is
almost impossible to parse accurately, especially when time zones are
involved, and when time zone data isn't even returned by FTP.

Maybe you could check a hash of the repomd.xml file? You shouldn't have to
track too many different hashes.

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: MirrorManager crawler patch

2009-07-20 Thread Ricky Zhou
On 2009-07-20 09:34:47 AM, Bruno Wolff III wrote:
  2) MirrorManager currently doesn't check timestamps, and the solution to
 this isn't trivial, especially since with FTP, which returns
 directory listing data as just the text of the output.  This is
 almost impossible to parse accurately, especially when time zones are
 involved, and when time zone data isn't even returned by FTP.
 
 Maybe you could check a hash of the repomd.xml file? You shouldn't have to
 track too many different hashes.
For what it's worth, this hash checking already happens on repomd.xml
files for mirrors that are crawled via HTTP, and my patch added that
check to FTP mirrors as well.  

When talking on IRC with Matt, we realized that the check shouldn't be
necessary at all though, since the other files in the repodata are
successfully getting the repodata directory marked outdated (and we did
confirm that this was happening with the last bu mirror crawl).

Overall, I think the crawling has been working fine even without the
timestamp checking (apart from some issues caused by the timestamp
problem we recently saw), I just wanted to mention why that was
currently disabled.  

As another side note, mirrormanager is currently aware of what
directories are repositories:

 mdomsch sure
 mdomsch so, MM does know that that dir is a repository
 mdomsch class Directory:  repository = SingleJoin('Repository')
 mdomsch bu the crawler doesn't do anything special with that knowledge
 mdomsch perhaps it should
 mdomsch by definition, a Repository is a Directory that has a child 
directory named 'repodata'
 mdomsch but the whole directory tree starting at that Directory down, is 
part of the repository

So all of the framework should be in place for marking an entire
repository out of date if the repodata is out of date.  

Thanks,
Ricky


pgp4diwy3V9Ti.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list