On Tuesday, April 17, 2012 at 6:50 AM, Tarek Ziadé wrote: > On 4/17/12 11:57 AM, [email protected] (mailto:[email protected]) wrote: > > > by calculating the grand hash of each file hash. > > > > > > In this case, the checksum would not be a reliable indication that the > > files are actually up-to-date. For example, a mirror may keep updating > > files into the wrong location (not the location that is then used to > > serve the files), so that the files being served are from a stale copy. > > This is not theoretical - it actually happened in my mirror setup at one > > time. > > > > So you were updating a directory but serving another directory ? > > But then updating the right last-modified page people were seeing ? > > In that case, updating the checksum would have revealed you were on the > wrong set of files. > > Unless you script was updating everything on a stale copy that was not > published ? > > > > > > That could take a few hours per change. > > > why that ? you don't calculate the checksum of a file your already > > > have twice. > > > > > > Even if you do, it's very fast to call md5. > > > > > > try it: > > > > > > $ find mirror | xargs md5 > > > > > > this takes a few seconds at most on the whole mirror > > > > I tried it, and on my mirror, it took 27 minutes and 7 seconds. > > So not exactly hours, but not "a few seconds" either. > > > > oops sorry I ran it on the wrong directory, it's true that it takes more > time ! > > So on my centos 5 VM - which is quite slow and doing many other stuff > like running Jenkins jobs, running the "md5deep" program like this : > http://tarek.pastebin.mozilla.org/1574557 > > It took 15minutes and 1 second. It can be optimized of course, since > most directories are done quickly and everything is in /source. That > time can be divided by 2 at least with the proper load balancing between > a few md5 runners. > > But that just to be run *once*. You would not compute it on every mirror > update but keep all md5 values somewhere. > > So, recalculating the grand hash on every mirror update should takes a > few seconds because it would just consist of calculating the hash for > the new files, then > calculating the grand hash -- a loop that updates a md5 hash with 20k > hashes takes less than a second if I don't count the file reading. > > (see http://tarek.pastebin.mozilla.org/1574574) > > I am not sure why we're having this discussion since it's implementation > details, but it's fun :) > > If there's interest I can write a multiprocess-based script that keeps a > md5 database up-to-date > >
I'd be interested ;) Although i'd prefer sha256 personally. > > Cheers > Tarek > > > > > Regards, > > Martin > > > > > _______________________________________________ > Catalog-SIG mailing list > [email protected] (mailto:[email protected]) > http://mail.python.org/mailman/listinfo/catalog-sig > >
_______________________________________________ Catalog-SIG mailing list [email protected] http://mail.python.org/mailman/listinfo/catalog-sig
