Re: MirrorManager crawler patch

2009-07-20 Thread Bruno Wolff III
On Mon, Jul 20, 2009 at 01:27:11 -0400,
  Ricky Zhou ri...@fedoraproject.org wrote:
 2) MirrorManager currently doesn't check timestamps, and the solution to
this isn't trivial, especially since with FTP, which returns
directory listing data as just the text of the output.  This is
almost impossible to parse accurately, especially when time zones are
involved, and when time zone data isn't even returned by FTP.

Maybe you could check a hash of the repomd.xml file? You shouldn't have to
track too many different hashes.

___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: MirrorManager crawler patch

2009-07-20 Thread Ricky Zhou
On 2009-07-20 09:34:47 AM, Bruno Wolff III wrote:
  2) MirrorManager currently doesn't check timestamps, and the solution to
 this isn't trivial, especially since with FTP, which returns
 directory listing data as just the text of the output.  This is
 almost impossible to parse accurately, especially when time zones are
 involved, and when time zone data isn't even returned by FTP.
 
 Maybe you could check a hash of the repomd.xml file? You shouldn't have to
 track too many different hashes.
For what it's worth, this hash checking already happens on repomd.xml
files for mirrors that are crawled via HTTP, and my patch added that
check to FTP mirrors as well.  

When talking on IRC with Matt, we realized that the check shouldn't be
necessary at all though, since the other files in the repodata are
successfully getting the repodata directory marked outdated (and we did
confirm that this was happening with the last bu mirror crawl).

Overall, I think the crawling has been working fine even without the
timestamp checking (apart from some issues caused by the timestamp
problem we recently saw), I just wanted to mention why that was
currently disabled.  

As another side note, mirrormanager is currently aware of what
directories are repositories:

 mdomsch sure
 mdomsch so, MM does know that that dir is a repository
 mdomsch class Directory:  repository = SingleJoin('Repository')
 mdomsch bu the crawler doesn't do anything special with that knowledge
 mdomsch perhaps it should
 mdomsch by definition, a Repository is a Directory that has a child 
directory named 'repodata'
 mdomsch but the whole directory tree starting at that Directory down, is 
part of the repository

So all of the framework should be in place for marking an entire
repository out of date if the repodata is out of date.  

Thanks,
Ricky


pgp4diwy3V9Ti.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


MirrorManager crawler patch

2009-07-19 Thread Ricky Zhou
Hey, I think we might have found a bug in the mirror crawler where it
did not do the repomd sha256sum check if a mirror is checked via FTP.  I
think the crawler might still need a good bit of cleanup apart from
this, but here is an initial attempt at a patch to fix this:

http://ricky.fedorapeople.org/mirrormanager/0001-Check-sha256sum-of-repomd.xml-on-FTP-mirrors.patch

I did some testing of this against fedora.bu.edu on bapp1 and it seemed
to mark the F10/F11 repodata directories outdated as expected.  

Given the mirror issues that we've been havin recently, it might be
worth considering live patching this until the next mirrormanager
release.

Thanks,
Ricky


pgpTeVQdAz7RA.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: MirrorManager crawler patch

2009-07-19 Thread Ricky Zhou
On 2009-07-20 12:13:41 AM, Ricky Zhou wrote:
 Hey, I think we might have found a bug in the mirror crawler where it
 did not do the repomd sha256sum check if a mirror is checked via FTP.  I
 think the crawler might still need a good bit of cleanup apart from
 this, but here is an initial attempt at a patch to fix this:
 
 http://ricky.fedorapeople.org/mirrormanager/0001-Check-sha256sum-of-repomd.xml-on-FTP-mirrors.patch
I just took a closer look at this with Matt, and it turns out that my
extra code in this patch shouldn't be necessary (and in fact, doesn't
seem to run at all).  I'm going to look at testing this more on another
outdated site.

Thanks,
Ricky


pgpViIeXud5xL.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list


Re: MirrorManager crawler patch

2009-07-19 Thread Ricky Zhou
On 2009-07-20 12:28:34 AM, Ricky Zhou wrote:
 I just took a closer look at this with Matt, and it turns out that my
 extra code in this patch shouldn't be necessary (and in fact, doesn't
 seem to run at all).  I'm going to look at testing this more on another
 outdated site.
Hi, Matt and I just spoke on IRC more, and I think we have a slightly
better idea of the issue now.  I think mirrormanager returning outdated
mirrors might have actually been related to the mounts issue as well.

One issue that we realized was that for F10 updates, yum uses a URL
similar to

http://mirrors.fedoraproject.org/mirrorlist?repo=updates-released-f$releaseverarch=$basearch

to generate the directory to get repodata from.  However, this returns
the path to pub/fedora.redhat/linux/updates/10/x86_64 on mirrors, and
while that may be up to date (since mirrormanager only checks the 10
newest files in that directory, and the recent timestamp issues may have
made this test unreliable), the
pub/fedora.redhat/linux/updates/10/x86_64/repodata may not be.

In the case of the bu mirror, we found that
pub/fedora.redhat/linux/updates/10/x86_64/repodata was properly marked
outdated, but pub/fedora.redhat/linux/updates/10/x86_64 was not.

Another issue that Matt mentioned is that report_mirror will tell
mirrormanager to mark any directory that the site claims to have as
up2date, the idea being that mirrors run rsync  report_mirror.  This
does seem to be cause issues during mass mirror issues like this though,
and Matt also brought up the issue that some mirrors may run
report_mirror even if the rsync fails.  

Some issues/responses we discussed:
1) For the first issue, we need to mark an entire repository outdated if
   the repodata is outdated.  This should start happening properly as
   well now that the timestamps issue is fixed, although we can do this
   explicitly in the code as well.
2) MirrorManager currently doesn't check timestamps, and the solution to
   this isn't trivial, especially since with FTP, which returns
   directory listing data as just the text of the output.  This is
   almost impossible to parse accurately, especially when time zones are
   involved, and when time zone data isn't even returned by FTP.
3) Perhaps it could be good to change some behavior with report_mirror.
   Right now, when public mirrors run it, it gives the benefit of
   starting to send traffic to the mirror as soon as possible after
   syncing, but in situations like the current one, this behavior can
   lead to outdated mirrors being marked up2date in MirrorManager.

Thanks,
Ricky


pgpQ50AxmwP9P.pgp
Description: PGP signature
___
Fedora-infrastructure-list mailing list
Fedora-infrastructure-list@redhat.com
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list