Hello, Following is a rather lengthy discussion of problems that we at Jungo LTD have faced with using CVS, and solutions used. We especially go into details of a new concept we call 'soft-tagging'. This concept was such a success with our users that we feel obliged to spread the idea.
This first mail generally describes the issues encountered and the solutions we chose. Jungo is using CVS for its SCM, handling ~40 GB of data, in ~70 K directories, ~300K files, and ~3000 tags/branches. We have ~100 developers using CVS, automated build-and-test hosts using CVS, and all our infrastructure servers using CVS as well to automatically fetch their configuration. Jungo also uses an automated merging tool to merge bug fixes and features between product branches. We have encountered The following problems with CVS: - Slow checkout times: over 1.5 hours. - Slow update times: over 40 minutes. - Slow tagging and branching: over 12 hours!!! - Automated merging failed on some delicate cases due to subtle bugs in CVS logic: - Binary files - adding a file to MAIN trunk, then to a branch. - A full checkout of the repository takes 5 GB, while a typical engineer only needs a 1GB subset of the repository. We reviewed the options of moving to another opensource SCM system (such as subversion), or to a commercial tool (such as clearcase). In the end we decided that the internal design of CVS is actually the best design, so we preferred to just improve the (good) basic design of CVS, rather than move to a completely new tool which has a lot of design problems. All the solutions implemented by Jungo are 100% server side, thus transparent to CVS clients, so no need to update CVS clients working with the server to make use of the new features. The only feature which requires a patched client in order to work is the Sticky Modules feature (which we will describe later on), but also for this feature we have now came up with a new design that if implemented will enable the Sticky Modules feature to be transparent to the CVS clients. A volunteer from the CVS community would be most welcome to receive from us the design document, and implement it. Also, all the features have configuration flags to enable/disable them, to keep behavior compatible with previous versions of CVS. In addition, many sanity tests were added to validate the behavior of these features. After we implemented all the improvements, we ended with a CVS server system that performs very fast and in a highly reliable manner. Following are the solutions used by Jungo to handle these problems: ----------------------------------------------------------------- Problem: slow tagging and branching Solution: immediate tagging and branching Looking at the RCS files on the server, we found that on average the tag/branch symbols info accounted for over 70% of the contents of the RCS files! 3000 tags take a lot of room compared to a small typical C source file... The soft-tagging/soft-branching feature allows one to define tags/ branches using a BRANCH+DATE setting. This is a tremendous step forward to CVS server, allowing it to out perform any other SCM system in tagging/branching time. It also allows for the first time in CVS to keep accurate track of when was a TAG logically done. Before soft-tagging, one could only try to guess the Tag's date from looking at all the files where the tag was included. This method works, since over 95% of tags/branches in typical project the tag/branch is for the same BRANCH+DATE for all its files. The soft-tagging/soft-branching still leaves in tact the regular manual tagging (now referred as 'hard-tags' and 'hard-branches'...). Soft-tagging is designed to live well together side-by-side with hard-tagging. It even allows using mixed-mode tags: if you added a soft-tag for the whole repository (300K files) and now you want to manually move the tag on 2 of the files, you can simple do a hard-tag on these two files to override the soft-tag! The implementation originates in code from the newtags2 patch available on the Savannah.gnu.org CVS server, extending the concept of built-in (calculated) tags. This is also related to the slow check-out problem, discussed below. The design outline is at the end of this email. TESTS: sanity.sh test name: soft-tags. CONFIG: CVSROOT/stags: define here the soft-tags and soft-branches. COMPATIBILITY: all clients ----------------------------------------------------------------- Problem: binary files merging complains on conflict even when identical Solution: corrected behavior in merging binary identical files We understand that Binary files cannot be merged when there are REAL conflicts. The problem is that CVS does not handle binary files merges even when the files are 100% identical and there is no real conflict! In text files, if you do 'cvs update' and the server has a new revision which is identical to your locally modified revision, cvs will say: "Changes already in file". So we also made binary file merge check if the locally modified file is 100% identical to the newly merged file. If so - it will allow the merge. TESTS: sanity.sh test name: binarymerge CONFIG: CVSROOT/config: MergeBinaryIdentical=y MergeAddToNonDead=y COMPATIBILITY: all clients ----------------------------------------------------------------- Problem: 'cvs up -j -j' incorrect behavior when adding a file to trunk, and later on to branch. Solution: fixed logical bug related to the way CVS creates revisions for new files on trunk and branches. Jungo's automated merging system requires 'cvs up -j -j' to always work 100% correctly. With plain CVS it worked correctly 98% of the time, leaving few small issues handled incorrectly. This issue is fixed by always starting with a dead revision - possibly having revision 1.1 dead. CVS 1.12.13 added dead revision on branch, which partly solves this issue, but still left some cases uncovered. 'cvs update -rBRANCH:DATE' and 'cvs up -j -j' do not work correctly in relation to branches. Example: Jan-07: Before this file was ever added, we tag the whole tree: cvs tag -b Jan-branch Feb-07: We add a new file (cvs add) to trunk: 1.1 Feb-07 alive Now we do 'cvs tag -b Feb-branch': 1.1.0.2 alive (Feb-branch) Mar-07: Then continue work on trunk: 1.2 Mar-07 alive Apr-07: modify on Feb-branch: 1.1.2.1 Apr-07 alive (Feb-branch) and on Jan-branch (we add the file: cvs add): 1.2.0.2 Jan-branch 1.2.2.1 Apr-07 alive (Jan-branch) Problem: revision 1.2 is not really the parent of 1.2.2.1. The real parent should be 1.0 (not that such a revision exists...). A simple way to see this problem: 'cvs up -rJan-branch:1-Mar-07' will retrieve revision 1.2 (live), instead of a dead revision! Solution: Always branch out pre-existing branches from BEFORE the first live revision on the trunk: So always make sure 1.1 is dead (CreateTrunkDeadRev) and always branch out pre existing branches from 1.1 (BranchFromBaseRev). Since 'cvs up' gives incorrect results, 'cvs up -j -j' will give incorrect results as well. And what about cases that the RCS file already has 1.1 live since the RCS ,v file was created in older versions of CVS server before this issue has been fixed? The solution is to branch from 1.1, and create a DEAD revision on the branch (1.1.2.1 dead and 1.1.2.2 live) and to put the date of 1.1.2.1 revision the same as the 1.1 revision. This 'backwards compatiblity' hack solves the problem with pre-patch existing CVS repositories. TESTS: sanity.sh test name: createdeadrev CONFIG: CVSROOT/config: CreateTrunkDeadRev=y BranchFromBaseRev=y MergeAddToNonDead=y COMPATIBILITY: all clients ----------------------------------------------------------------- Problem: doing 'cvs ci' after 'cvs rm' does not print out the revision being added to the RCS file Solution: allow printing out the newly added revision for remove operation, to be compatible with the rest of the operations. It is hard to guess the version for a file when it is removed, especially in the case when removing a file that had no real revision on the branch. The 'cvs ci' message for a typical commit is: cvs_repository/dir/f1,v <-- f1 new revision: 1.3.2.1; previous revision: 1.3 The 'cvs ci' message for 'cvs rm' is: cvs_repository/dir/f1,v <-- f1 new revision: delete; previous revision: 1.3.2.2 Jungo's automated merging system relies on the output of 'cvs ci' to learn the revision numbers added to the RCS file, so it is critical to have 'cvs ci' print out the revision for ALL operations. the modified format: cvs_repository/dir/f1,v <-- f1 new revision: delete 1.3.2.3; previous revision: 1.3.2.2 CONFIG: CommitShowDeletedVer=y COMPATIBILITY: all clients ----------------------------------------------------------------- Problem: modules only work in checkout: cannot work effectively and simply with repository subsets Solution: keep sticky modules setting CVS implementation of modules works for the checkout. However, once checked out the module setting does not persist. Developers want to get any new directories added to the same module with 'cvs up -d', but doing this with plain CVS will also bring directories excluded from the module. This feature is very important for very large projects, such as BSD or Mozilla, where one would typically want to check out only a subset of the project. In Jungo, a full checkout is around 5GB. A typical partial check out is less than 1GB. This means that this feature improves x5 the 'cvs checkout' speed (and subsequently the 'cvs update' speed)! It also saves disk space! - Many engineers have multiple checked out trees on their PCs.... Since "sticky modules" is such an important feature for huge projects, Mozilla project wrote a whole wrapper ABOVE cvs that does exactly this: http://www.csie.ntu.edu.tw/~piaip/docs/CreateMozApp/mozilla-app-a.html There is a Makefile script called "client.mk" that is a wrapper ABOVE cvs that checks out and updates subsets of Mozilla CVS trees. The required behavior for this feature needs to be similar to CVSNT's modules2 feature. The problem with CVSNT's implementation is that modules2 was not just designed to enable subset checkouts and updates, but it also has a "smart" capability of virtually changing names of sub-directories. This is NOT a good thing. It is similar to the "smart" feature of many SCM systems that allow to rename and move files and subdirectories. All these type of "smart" features have bad impact of the logical correctness of an SCM. The current implementation is not client-transparent. It requires patched CVS clients with: #define STICKY_MODULES 1 Older CVS clients will still be able to work with the CVS server, but the modules behavior will be as with vanilla CVS: non sticky. The sticky-modules feature keeps the module name in a file named 'Module', in the CVS directory together with Root, Repository, Entries, and friends. like "Tag", it is there only when sticky module is used. We already published the code for this specific feature as a patch to CVS-1.11.9, in: http://ximbiot.com/cvs/wiki/index.php?title=CVS_FAQ#keeping_.27cvs_up_-d.27_with_modules_persistant We have come up with a new design which we did not yet implement - of how to allow the sticky modules feature to be client transparent (thus not changing the Client/Server protocol, and not requiring patched CVS clients to utilize this feature). We believe the implementation of this would be no more than 100-200 lines of patch to the CVS source code. If anyone would like to volunteer to implement this - please contact us and we will provide the design and guidelines. TESTS: no tests were added. CONFIG: src/cvs.h: #define STICKY_MODULES 1 CVSROOT/config: StatusShowModule=y --> makes 'cvs status' print also print out the sticky module name. COMPATIBILITY: patched CVS clients with compiled with STICKY_MODULES defined. PROTOCOL: a Module command was added to the CVS Client/Server protocol. ----------------------------------------------------------------- Problem: 'cvs import' command does not work correctly: Solution: alternative application... We added (see later on in the email) an external application that implements "cvs import" differently (we call it "jcvs import"). Since we wanted people NOT to use the built-in "cvs import" and only use "jcvs import" (since the logic of the built in import is totaly wrong), we added to cvs an "import disable" feature, to prevent users from mistakenly using the built-in "cvs import". CONFIG: CVSROOT/config: DisableImport=y ======================================================================= Beyond the improvements inside CVS source code, we also implemented some improvements as an external applications above CVS (a wrapper). We called it jcvs (Jungo CVS), and it calls CVS when needed (for low level CVS operations). The source code for this application is completely separate from CVS, and is not part of this patch, and we did not "productise" it (sanity tests, documentation etc...), since we use it only internally in Jungo in the meantime. It is a bit in the concept of "cvsu" application that is a kind of off- line cvs client, "jcvs" complements CVS with commands that do not exist in CVS, or that can be improved. If anyone is interested in productizing this, please contact us, and we will provide the source code under GPL. ----------------------------------------------------------------- Problem: 'cvs import' problem working incorrectly Solution: alternative application: 'jcvs import' 'cvs import' implementation has many logical bugs. It should really behave exactly like "cvs commit", where that you can first review what is going to get into the repository (what files will be added, removed, changed, and to be able to see the diff of the changes), then you "approve" this by doing a "checkin" operation (i.e. "cvs commit"), and then the server must pass it through all the regular validations that regular commits go through. There is no reason that imported code will not pass the same validations as regular committed code. Then comes the question to what branch will this code be imported: Why should imported code not have a symbolic name of a branch, which the user can select? why does it have to be this cryptic 1.1.1.x branch?! And what if the user wants to import directly into main branch (HEAD) or into a different branch of his selection? All RCS revision numbers have a very straight forward simple logic: the revisions are built like a tree, 1.x being the trunk, then 1.1.2.x 1.1.4.x 1.1.6.x etc for example being branched out of 1.1 (in this example). This logic means that if we want the newest revision on a branch, we need to find out the branch's number (1.x for HEAD or 1.1.2.x for example for a certain branch) and find the highest "x" value of an available revision. Simple? Yes. And NO...: "cvs import" breaks all this: for the newest revision of HEAD you need the latest 1.x, but if 1.2 does not exist and 1.1.1.x does exist, then you have an exception where you need to take the newest revision of 1.1.1.x! STRANGE! There are many bugs documented relating to "cvs import". Another example, from the official CVS manual: http://ximbiot.com/cvs/manual/cvs-1.11.22/cvs_13.html "WARNING: If you use a release tag that already exists in one of the repository archives, files removed by an import may not be detected. " Why do all these problems exist? because "cvs import" concept is wrong. What is needed is that the "cvs import" feature will behave as a tool to add local CVS controlled files for a given subdirectory that is not version controled. The procedure for importing a package (let's assume linux-2.6.18.tgz) would be something like this: $ tar xvzf linux-2.6.18.tgz $ cd linux $ jcvs import -b linux-original project/os/linux $ cvs commit -m "import Linux 2.6.18 from kernel.org" $ cvs tag linux-2_6_18 and if we want this to be merged to HEAD, we will then also do: $ cvs up -A $ cvs up -j HEAD -j linux-2_6_18 (or may do "cvs up -j HEAD -j linux-original") $ cvs commit -m "merge Linux 2.6.18" This is simple logic, that fixes all the problems with the regular "cvs import". So: what is the behavior required from this new "jcvs import" to make the above sequence work? It is required to add CVS control files (CVS/Root, CVS/Entries, CVS/Repository) in a given sub tree that DOES NOT have any CVS control directory. It needs to mark all files as "A" (add) if they exist on the branch (the -b option supplied to the "cvs import" command), or "M" (modified) for all files that already exist on the branch - but their contents do not match, or "R" (remove) for all files that do not exist in the sub-tree, but DO exist in the repository on that branch. In order to be able to prepare the files with "A", "M" and "R" for "cvs commit", notice 'jcvs import' may also need to create empty directories on the CVS server (by doing "cvs add" to the required missing directories). This means that doing the combination of "jcvs import" and then "cvs commit" brings the CVS repository in-sync with a tarball! To sum up this feature: if you have a subtree of files and you want a subdir in the repository on a certain branch to be 100% in sync with this existing subtree, you do "jcvs import". We also believe that in the long term, the original "cvs import" code should be removed, and replaced with code that behaves exactly like Jungo's "jcvs import" - since this is the correct logical behavior "cvs import" should behave like! ----------------------------------------------------------------- Problem: slow 'cvs update' Solution: 'jcvs update': DB-based delta fetching using external application... 'cvs update' of Jungo main CVS tree involves going over ~250K files. The average number of commit sessions (commitid) per branch per day is less than 10. The number of whole tree updates on this tree is in the hundreds. In Jungo, like in most CVS server setups, we have ViewCVS installed. So each commit is recorded in the ViewCVS DB (lately renamed to ViewCV for politically correctness...). We have created a wrapper application that keeps a timestamp of last update and tries to get the delta from the DB when feasible. So when you do 'jcvs update' it sends an SQL query to ViewCVS to get the list of files modified on the required branch since the last 'jcvs update' (or since 'cvs checkout' or the tree, if 'jcvs update' was not run yet on this local tree). But what about locally modified files? Well, it also runs "cvsu" application to get the list of local changes. It merges the list of server changes (ViewCVS) and local changes ("cvsu"), and then it calls 'cvs update' with the SPECIFIC list of files. The result: 'jcvs update' for 250K files takes under 1 minute, instead of over 40 minutes that regular 'cvs update' calls take. The 1 minute is mainly due to the time it takes 'cvsu' to scan all the local files for detecting modifications. This means the load on the CVS server and the network traffic are very low! ======================================================================= Soft Tagging design details --------------------------------------- The problems ------------ The operation of tagging in CVS is using the file's RCS file to store the tags and branches defined for the file. - Currently we have in the our main CVS repository ~250,000 files. each tag operation needs to write to all these files. This takes hours to complete. During this time CVS performance is seriously degraded. - Having all the tags in the files bloats the files by orders of magnitude. We found out that a typical file has 50KB worth of tags in it, and on average 5KB of real content. Solution summary ---------------- - We use 'soft tagging' - We define soft tags and branches by writing in file CVSROOT/stags the tag name with branch & date. - We avoid writing of tag in RCS files, so tag time is reduced to seconds. - CVS is updated to look for symbolic tag names in soft tags list as well as in the RCS list of symbols. - Legacy hard tags can later be converted to soft tags at our leisure. This requires analyzing the tags on the files to make sure that they have a single point in time on a specific branch (or, for that matter, a contiguous series of points, i.e. a a specific period of the branch life time), that all the revisions tagged with the tag share. The solution ------------ - Keep in file CVSROOT/stags list of soft tags: - format of line: for tags T <tag name> <branch>:<date UTC in RCS format> <repository> for branches B <tag name> <branch>:<date UTC in RCS format> <repository> - example: tag-4_0_5 branch-4_0 2007.05.28.11.04.19 # version 4.0.5 - This defines the tag in terms of date on the branch. This allows us to replace the writing to all the files with a single update in this file. Soft tagging takes seconds instead of hours for hard tagging. - instead of branch+date the soft-tag can use the name of another tag - this serves as an aliasing mechanism. - repository field is used to make sure that the tag is used only where it's supposed to be used. it can contain either name of top-level directory, e.g. 'seamonkey', or a path within them, like 'seamonkey/browser/components/bookmarks/'. - blocking users from doing 'cvs tag' outside the destined repository can be done by adding a filter in CVSROOT/taginfo. - Users can define the tags as soft tags by adding a line for the tag in CVSROOT/stags. optionally, one can choose (as done in Jungo) to auto- create this file based on contents of an external system defining the tags and versions used (Jungo uses the same system to define names of tags and branches shown in our Bugzilla-based issue-tracking system). - CVS update is modified to handle selection by soft tags - see "selecting file version by tag" below. - CVS commit is changed to do hard-branch only where necessary. See "Doing soft branching". Selecting file version by tag ----------------------------- For each file, we select the version during update using the logic described Below. Since checkout is based on update, this covers it as well. This is implemented in CVS code in file src/rcs.c, function translate_symtag(). - if parameter to "-r" matches 'dots separated numbers' format: - return the version it specifies - if symbolic tag is in RCS - use the definition from RCS - if the tag is in CVSROOT/stags, look up matching version on branch - if found matching version, return that version - return NULL Doing soft branching -------------------- - On commit: - if the file has sticky tag set which is a soft tag of type branch, and it is not found in the file's RCS data: - do hard tagging of file, using the version specified in the stags, i.e. add branch for that soft-branch to RCS file. - continue as usual - This assures that branch tags are added to RCS files only when necessary. This way not only soft tagging is immediate but also soft-branching. The RCS file needs to be written to only when a commit to a file requires a real revision on the branch to be created. ======================================================================= Any comments and feedback are welcome. Final notes: ------------------ - I will post a patch with all our CVS changes (diff vs. GNU CVS 1.12.13) to the CVS project in GNU Savannah site (https://savannah.nongnu.org/patch/?group=cvs). - work on CVS improvements in Jungo began in 2003. People involved, apart from me, included Derry Shribman & Or Tal. Yaron Yogev <[EMAIL PROTECTED]> Jungo LTD _______________________________________________ Bug-cvs mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/bug-cvs
