On 17/12/10 16:44, Trevor Vaughan wrote: > I've been looking at the usage of MD5 checksums by Puppet and I think > that there may be room for quite a bit of optimization.
I do agree. > The clients seem to compute the MD5 checksum of all files and in > catalog content every time they compare two files. What if: > > 1) The size of any known content is used as a first level comparison. > Obviously, if the sizes differ, the files differ. I don't see this in > 0.24.X, but I haven't checked 2.6.X. That's more or less what rsync does. For sourced files we could even use HTTP If-Modified-Since and/or If-None-Match to perform the check (and thus the check would be done server side). > 2) The *server* pre-computes checksums for all content items in File > resources and passes those in the catalog, then only one MD5 sum needs > to be calculated. That's something I already noticed when I worked on the file streaming. We're constantly checksuming files. For instance we perform a full checksum when writing a file, then once written we checksum againt to make sure we wrote the file fully. At that time I left the code as it was, but I think this might not be necessary. > 3) When using the puppet server in a 'source' element, the server > passes the checksum of the file on the server. If they differ, then > the file is passed across to the client. As I said earlier we really could leverage If-Modified-Since/If-None-Match HTTP/1.1 system for this. > 4) For ultimate speed, a direct comparison should be an option as a > checksum type. Directly comparing the content of the in-memory file > and the target file appears to be twice as fast as an MD5 checksum. > This would not be feasible for a 'source'. That might be faster, but please don't re-introduce the slurp the whole file in memory syndrom. > These techniques will place more burden on the server, but may cut the > CPU resources needed on the client by as much as half from some > preliminary testing. > > user system total real > MD5: 0.810000 0.230000 1.040000 ( 1.050886) > MD52: 0.400000 0.120000 0.520000 ( 0.525936) > Hash: 0.550000 0.270000 0.820000 ( 0.821033) > Comp: 0.290000 0.120000 0.410000 ( 0.407351) > > MD5 -> MD5 comparison of two 100M files > MD52 -> MD5 comparison where one file has been pre-computed > Hash -> Using String.hash to do the comparison > Comp -> Direct comparison of the files For comp: did you read the file fully in RAM or did you do it block by block? If you read it fully, can you try again the experiment by reading it block by block (let's say 8k) with equivalent files (so that's your worst case) and compare to the in-memory solution? For file-change-comparison we might introduce some new checksums that are quite less cpu-hogs than full message digests. I'm really not an expert in this, so maybe I'm completely wrong, but combining size-change, mtime-change and a fletcher/adler or other CRC checksum might give us what we want. > If anyone can provide a quick and dirty hack to get these into Puppet, > I'll be happy to test them. That's really something I'd like to work on. Unfortunately this is really complex stuff. The file type is one of the biggest type and even though I already worked on it, I'm not sure I grasped enough to be able to fully refactor it for a different inner working. -- Brice Figureau My Blog: http://www.masterzen.fr/ -- You received this message because you are subscribed to the Google Groups "Puppet Developers" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/puppet-dev?hl=en.
