On Dec 25, 2012, at 2:47 PM, Urs Thuermann wrote: > Paul Sander <[email protected]> writes: > >> The specific reason for this is because CVS assumes that it was the >> last to modify a file if its mod time matches the one recorded in >> its Entries file. If it's quickly modified by something else, then >> CVS may still think it's up to date and both "cvs update" and "cvs >> commit" will produce incorrect results. >> >> There has been much discussion on this topic, and you can see >> discussion of the rationale in the info-cvs archives. > > OK, I've looked up the topic in the archives. I assume it has already > been suggested to change the "Entries" file format to use a hash > instead of a time stamp. But I haven't seen this in the info-cvs > archive. So wouldn't this be an option? Otherwise, I'd like a > command line option to disable the sleep, probably with a BIG warning > that it should only be used if you know what you do.
I think that using hashes might have been discussed, but I don't recall specific conversations. The reliability of hashes, even cryptographic ones, isn't foolproof, either. Random content of the files is a simplifying assumption when designing applications around hashes, and source code isn't random content. The MD5 and SHA-1 hashes have been broken in ways that I believe matches use cases describing the natural evolution of source code. This weakens the reliability claims of hashes to a degree, but truthfully I don't know to what extent. (Perhaps the effect is negligible in the real world, at least in projects for which CVS is used. Reducing the theoretical probability of collision by several orders of magnitude still leaves a huge number of files processed without incident.) Anyway, if the use of hashes is limited strictly to the replacement of timestamps in the Entries file (i.e., compute a hash at the time CVS writes the file to the sandbox and write it to the Entries file, then later recompute the hash and compare it to the Entries file), then the effect of a collision is the same as we have observed with timestamps: Incorrect behavior of subsequent operations due to files believed to be up to date when they are not. The difference is that the breakage will be deterministic and there will be no simple workaround, plus the overhead of computing the hashes may become a factor. (Note that it is useless to store hashes in the RCS files due to keyword expansion, so you can't amortize some of the cost by storing them at commit time.) > I have a script that calls cvs checkout hundreds to thousands of times > and that causes the script to run for half an hour or so instead of a > few seconds. The info-cvs archive also suggests using RCS tools > instead of CVS. Is it guaranteed that the CVS repository files will > always have RCS format and RCS tools will work on them? What is your use case that requires you to invoke "cvs checkout" so many times? Over a 30-minute interval, the sleeps begin to dominate execution time at 900 invocations. At that rate, CVS' locking mechanisms are also significantly impacting performance. Perhaps you are checking out each source file individually? If this is true then you should consider reducing the number of invocations. You can do this by checking out directory trees or by simply specifying multiple paths on the "cvs checkout" command line. The use of tags or branch/timestamp pairs would be useful here. The use of xargs might also be helpful. If you have path/version pairs (or path/tag pairs or even path/branch/timestamp triples) then you can use RCS and conjure the CVS meta-data yourself. As you have discovered, there has been some discussion of this method that appears in the archives detailing why it's fast and reliable. I have successfully done this method myself. To my knowledge, CVS uses the standard RCS file format. RCS produces warnings if newphrase extensions are used in certain contexts, e.g. in the initial admin section of the RCS file. My experience in that area is dated, so I don't know if this would be an issue with current versions of either tool.
