Re: [Catalog-sig] PyPI mirrors are all up to date

Donald Stufft Tue, 17 Apr 2012 03:55:08 -0700

On Tuesday, April 17, 2012 at 6:50 AM, Tarek Ziadé wrote:
> On 4/17/12 11:57 AM, [email protected] (mailto:[email protected]) wrote:
> > > by calculating the grand hash of each file hash.
> >  
> >  
> > In this case, the checksum would not be a reliable indication that the
> > files are actually up-to-date. For example, a mirror may keep updating
> > files into the wrong location (not the location that is then used to
> > serve the files), so that the files being served are from a stale copy.
> > This is not theoretical - it actually happened in my mirror setup at one
> > time.
> >  
>  
> So you were updating a directory but serving another directory ?
>  
> But then updating the right last-modified page people were seeing ?
>  
> In that case, updating the checksum would have revealed you were on the  
> wrong set of files.
>  
> Unless you script was updating everything on a stale copy that was not  
> published ?
>  
>  
> > > > That could take a few hours per change.
> > > why that ? you don't calculate the checksum of a file your already  
> > > have twice.
> > >  
> > > Even if you do, it's very fast to call md5.
> > >  
> > > try it:
> > >  
> > > $ find mirror | xargs md5
> > >  
> > > this takes a few seconds at most on the whole mirror
> >  
> > I tried it, and on my mirror, it took 27 minutes and 7 seconds.
> > So not exactly hours, but not "a few seconds" either.
> >  
>  
> oops sorry I ran it on the wrong directory, it's true that it takes more  
> time !
>  
> So on my centos 5 VM - which is quite slow and doing many other stuff  
> like running Jenkins jobs, running the "md5deep" program like this :  
> http://tarek.pastebin.mozilla.org/1574557
>  
> It took 15minutes and 1 second. It can be optimized of course, since  
> most directories are done quickly and everything is in /source. That  
> time can be divided by 2 at least with the proper load balancing between  
> a few md5 runners.
>  
> But that just to be run *once*. You would not compute it on every mirror  
> update but keep all md5 values somewhere.
>  
> So, recalculating the grand hash on every mirror update should takes a  
> few seconds because it would just consist of calculating the hash for  
> the new files, then
> calculating the grand hash -- a loop that updates a md5 hash with 20k  
> hashes takes less than a second if I don't count the file reading.
>  
> (see http://tarek.pastebin.mozilla.org/1574574)
>  
> I am not sure why we're having this discussion since it's implementation  
> details, but it's fun :)
>  
> If there's interest I can write a multiprocess-based script that keeps a  
> md5 database up-to-date
>  
>


I'd be interested ;) Although i'd prefer sha256 personally.  
>  
> Cheers
> Tarek
>  
> >  
> > Regards,
> > Martin
> >  
>  
>  
> _______________________________________________
> Catalog-SIG mailing list
> [email protected] (mailto:[email protected])
> http://mail.python.org/mailman/listinfo/catalog-sig
>  
>

_______________________________________________
Catalog-SIG mailing list
[email protected]
http://mail.python.org/mailman/listinfo/catalog-sig

Re: [Catalog-sig] PyPI mirrors are all up to date

Reply via email to