CVS soft-tagging + other problems & solutions

yarony Wed, 15 Aug 2007 04:40:32 -0700

Hello,

Following is a rather lengthy discussion of problems that we at Jungo
LTD have
faced with using CVS, and solutions used. We especially go into
details of a
new concept we call 'soft-tagging'. This concept was such a success
with our
users that we feel obliged to spread the idea.


This first mail generally describes the issues encountered and the
solutions we
chose.

Jungo is using CVS for its SCM, handling ~40 GB of data, in ~70 K
directories,
~300K files, and ~3000 tags/branches. We have ~100 developers using
CVS,
automated build-and-test hosts using CVS, and all our infrastructure
servers
using CVS as well to automatically fetch their configuration.

Jungo also uses an automated merging tool to merge bug fixes and
features
between product branches.

We have encountered The following problems with CVS:
- Slow checkout times: over 1.5 hours.
- Slow update times: over 40 minutes.
- Slow tagging and branching: over 12 hours!!!
- Automated merging failed on some delicate cases due to subtle bugs
in CVS
  logic:
  - Binary files
  - adding a file to MAIN trunk, then to a branch.
- A full checkout of the repository takes 5 GB, while a typical
engineer only
  needs a 1GB subset of the repository.

We reviewed the options of moving to another opensource SCM system
(such as
subversion), or to a commercial tool (such as clearcase).  In the end
we
decided that the internal design of CVS is actually the best design,
so we
preferred to just improve the (good) basic design of CVS, rather than
move to a
completely new tool which has a lot of design problems.

All the solutions implemented by Jungo are 100% server side, thus
transparent
to CVS clients, so no need to update CVS clients working with the
server
to make use of the new features.

The only feature which requires a patched client in order to work is
the Sticky
Modules feature (which we will describe later on), but also for this
feature we
have now came up with a new design that if implemented will enable the
Sticky
Modules feature to be transparent to the CVS clients. A volunteer from
the CVS
community would be most welcome to receive from us the design
document, and
implement it.

Also, all the features have configuration flags to enable/disable
them, to
keep behavior compatible with previous versions of CVS.

In addition, many sanity tests were added to validate the behavior of
these
features.

After we implemented all the improvements, we ended with a CVS server
system
that performs very fast and in a highly reliable manner.

Following are the solutions used by Jungo to handle these problems:

-----------------------------------------------------------------
Problem: slow tagging and branching
Solution: immediate tagging and branching

Looking at the RCS files on the server, we found that on average the
tag/branch
symbols info accounted for over 70% of the contents of the RCS files!
3000 tags take a lot of room compared to a small typical C source
file...

The soft-tagging/soft-branching feature allows one to define tags/
branches
using a BRANCH+DATE setting. This is a tremendous step forward to CVS
server,
allowing it to out perform any other SCM system in tagging/branching
time.  It
also allows for the first time in CVS to keep accurate track of when
was a TAG
logically done. Before soft-tagging, one could only try to guess the
Tag's date
from looking at all the files where the tag was included.

This method works, since over 95% of tags/branches in typical project
the
tag/branch is for the same BRANCH+DATE for all its files.
The soft-tagging/soft-branching still leaves in tact the regular
manual
tagging (now referred as 'hard-tags' and 'hard-branches'...).

Soft-tagging is designed to live well together side-by-side with
hard-tagging. It even allows using mixed-mode tags: if you added a
soft-tag for
the whole repository (300K files) and now you want to manually move
the tag on
2 of the files, you can simple do a hard-tag on these two files to
override the
soft-tag!

The implementation originates in code from the newtags2 patch
available on the
Savannah.gnu.org CVS server, extending the concept of built-in
(calculated)
tags.

This is also related to the slow check-out problem, discussed below.

The design outline is at the end of this email.

TESTS: sanity.sh test name: soft-tags.
CONFIG: CVSROOT/stags: define here the soft-tags and soft-branches.
COMPATIBILITY: all clients

-----------------------------------------------------------------
Problem: binary files merging complains on conflict even when
identical
Solution: corrected behavior in merging binary identical files

We understand that Binary files cannot be merged when there are REAL
conflicts. The problem is that CVS does not handle binary files merges
even
when the files are 100% identical and there is no real conflict!  In
text
files, if you do 'cvs update' and the server has a new revision which
is
identical to your locally modified revision, cvs will say: "Changes
already in
file".

So we also made binary file merge check if the locally modified file
is 100%
identical to the newly merged file. If so - it will allow the merge.

TESTS: sanity.sh test name: binarymerge
CONFIG: CVSROOT/config: MergeBinaryIdentical=y MergeAddToNonDead=y
COMPATIBILITY: all clients

-----------------------------------------------------------------
Problem: 'cvs up -j -j' incorrect behavior when adding a file to
trunk,
  and later on to branch.
Solution: fixed logical bug related to the way CVS creates revisions
  for new files on trunk and branches.

Jungo's automated merging system requires 'cvs up -j -j' to always
work 100%
correctly. With plain CVS it worked correctly 98% of the time, leaving
few small
issues handled incorrectly.
This issue is fixed by always starting with a dead revision - possibly
having
revision 1.1 dead. CVS 1.12.13 added dead revision on branch, which
partly
solves this issue, but still left some cases uncovered.

'cvs update -rBRANCH:DATE' and 'cvs up -j -j' do not work correctly in
relation
to branches.

Example:
Jan-07: Before this file was ever added, we tag the whole tree:
cvs tag -b Jan-branch

Feb-07: We add a new file (cvs add) to trunk:
1.1 Feb-07 alive

Now we do 'cvs tag -b Feb-branch':
1.1.0.2 alive (Feb-branch)

Mar-07: Then continue work on trunk:
1.2 Mar-07 alive

Apr-07: modify on Feb-branch:
1.1.2.1 Apr-07 alive (Feb-branch)
and on Jan-branch (we add the file: cvs add):
1.2.0.2 Jan-branch
1.2.2.1 Apr-07 alive (Jan-branch)

Problem:
revision 1.2 is not really the parent of 1.2.2.1. The real parent
should be 1.0
(not that such a revision exists...). A simple way to see this
problem: 'cvs up
-rJan-branch:1-Mar-07' will retrieve revision 1.2 (live), instead of a
dead
revision!
Solution: Always branch out pre-existing branches from BEFORE the
first live revision on the trunk: So always make sure 1.1 is dead
(CreateTrunkDeadRev) and always branch out pre existing branches from
1.1
(BranchFromBaseRev).

Since 'cvs up' gives incorrect results, 'cvs up -j -j' will give
incorrect
results as well.

And what about cases that the RCS file already has 1.1 live since the
RCS ,v
file was created in older versions of CVS server before this issue has
been
fixed?
The solution is to branch from 1.1, and create a DEAD revision on the
branch
(1.1.2.1 dead and 1.1.2.2 live) and to put the date of 1.1.2.1
revision the
same as the 1.1 revision. This 'backwards compatiblity' hack solves
the problem
with pre-patch existing CVS repositories.

TESTS: sanity.sh test name: createdeadrev
CONFIG: CVSROOT/config:
  CreateTrunkDeadRev=y
  BranchFromBaseRev=y
  MergeAddToNonDead=y
COMPATIBILITY: all clients

-----------------------------------------------------------------
Problem: doing 'cvs ci' after 'cvs rm' does not print out the revision
being
  added to the RCS file
Solution: allow printing out the newly added revision for remove
operation, to
  be compatible with the rest of the operations.

It is hard to guess the version for a file when it is removed,
especially in
the case when removing a file that had no real revision on the branch.

The 'cvs ci' message for a typical commit is:
cvs_repository/dir/f1,v  <--  f1
new revision: 1.3.2.1; previous revision: 1.3

The 'cvs ci' message for 'cvs rm' is:
cvs_repository/dir/f1,v  <--  f1
new revision: delete; previous revision: 1.3.2.2

Jungo's automated merging system relies on the output of 'cvs ci' to
learn
the revision numbers added to the RCS file, so it is critical to have
'cvs ci' print out the revision for ALL operations. the modified
format:
cvs_repository/dir/f1,v  <--  f1
new revision: delete 1.3.2.3; previous revision: 1.3.2.2

CONFIG: CommitShowDeletedVer=y
COMPATIBILITY: all clients

-----------------------------------------------------------------
Problem: modules only work in checkout: cannot work effectively and
simply with
repository subsets
Solution: keep sticky modules setting

CVS implementation of modules works for the checkout. However, once
checked out
the module setting does not persist. Developers want to get any new
directories
added to the same module with 'cvs up -d', but doing this with plain
CVS will
also bring directories excluded from the module.

This feature is very important for very large projects, such as BSD or
Mozilla,
where one would typically want to check out only a subset of the
project. In
Jungo, a full checkout is around 5GB. A typical partial check out is
less than
1GB. This means that this feature improves x5 the 'cvs checkout' speed
(and
subsequently the 'cvs update' speed)!

It also saves disk space! - Many engineers have multiple checked out
trees on
their PCs....  Since "sticky modules" is such an important feature for
huge
projects, Mozilla project wrote a whole wrapper ABOVE cvs that does
exactly
this:
http://www.csie.ntu.edu.tw/~piaip/docs/CreateMozApp/mozilla-app-a.html
There is a Makefile script called "client.mk" that is a wrapper ABOVE
cvs that
checks out and updates subsets of Mozilla CVS trees.

The required behavior for this feature needs to be similar to CVSNT's
modules2
feature. The problem with CVSNT's implementation is that modules2 was
not just
designed to enable subset checkouts and updates, but it also has a
"smart"
capability of virtually changing names of sub-directories.
This is NOT a good thing. It is similar to the "smart" feature of many
SCM
systems that allow to rename and move files and subdirectories. All
these type
of "smart" features have bad impact of the logical correctness of an
SCM.

The current implementation is not client-transparent. It requires
patched CVS
clients with:
#define STICKY_MODULES 1
Older CVS clients will still be able to work with the CVS server, but
the
modules behavior will be as with vanilla CVS: non sticky.

The sticky-modules feature keeps the module name in a file named
'Module', in
the CVS directory together with Root, Repository, Entries, and
friends.
like "Tag", it is there only when sticky module is used.

We already published the code for this specific feature as a patch to
CVS-1.11.9, in:
http://ximbiot.com/cvs/wiki/index.php?title=CVS_FAQ#keeping_.27cvs_up_-d.27_with_modules_persistant

We have come up with a new design which we did not yet implement - of
how to
allow the sticky modules feature to be client transparent (thus not
changing
the Client/Server protocol, and not requiring patched CVS clients to
utilize
this feature). We believe the implementation of this would be no more
than
100-200 lines of patch to the CVS source code.  If anyone would like
to
volunteer to implement this - please contact us and we will provide
the design
and guidelines.

TESTS: no tests were added.
CONFIG:
  src/cvs.h: #define STICKY_MODULES 1
  CVSROOT/config: StatusShowModule=y  --> makes 'cvs status' print
also print
  out the sticky module name.
COMPATIBILITY: patched CVS clients with compiled with STICKY_MODULES
defined.
PROTOCOL: a Module command was added to the CVS Client/Server
protocol.

-----------------------------------------------------------------
Problem: 'cvs import' command does not work correctly:
Solution: alternative application...

We added (see later on in the email) an external application that
implements
"cvs import" differently (we call it "jcvs import").

Since we wanted people NOT to use the built-in "cvs import" and only
use "jcvs
import" (since the logic of the built in import is totaly wrong), we
added to
cvs an "import disable" feature, to prevent users from mistakenly
using the
built-in "cvs import".

CONFIG: CVSROOT/config:
  DisableImport=y

=======================================================================

Beyond the improvements inside CVS source code, we also implemented
some
improvements as an external applications above CVS (a wrapper).  We
called it
jcvs (Jungo CVS), and it calls CVS when needed (for low level CVS
operations).

The source code for this application is completely separate from CVS,
and is
not part of this patch, and we did not "productise" it (sanity tests,
documentation etc...), since we use it only internally in Jungo in the
meantime.

It is a bit in the concept of "cvsu" application that is a kind of off-
line cvs
client, "jcvs" complements CVS with commands that do not exist in CVS,
or that
can be improved.

If anyone is interested in productizing this, please contact us, and
we will
provide the source code under GPL.

-----------------------------------------------------------------
Problem: 'cvs import' problem working incorrectly
Solution: alternative application: 'jcvs import'

'cvs import' implementation has many logical bugs. It should really
behave
exactly like "cvs commit", where that you can first review what is
going to get
into the repository (what files will be added, removed, changed, and
to be able
to see the diff of the changes), then you "approve" this by doing a
"checkin"
operation (i.e. "cvs commit"), and then the server must pass it
through all the
regular validations that regular commits go through.

There is no reason that imported code will not pass the same
validations as
regular committed code.  Then comes the question to what branch will
this code
be imported: Why should imported code not have a symbolic name of a
branch,
which the user can select?  why does it have to be this cryptic
1.1.1.x
branch?!

And what if the user wants to import directly into main branch (HEAD)
or into a
different branch of his selection?
All RCS revision numbers have a very straight forward simple logic:
the
revisions are built like a tree, 1.x being the trunk, then 1.1.2.x
1.1.4.x
1.1.6.x etc for example being branched out of 1.1 (in this example).
This logic
means that if we want the newest revision on a branch, we need to find
out the
branch's number (1.x for HEAD or 1.1.2.x for example for a certain
branch) and
find the highest "x" value of an available revision. Simple? Yes. And
NO...:
"cvs import" breaks all this: for the newest revision of HEAD you need
the
latest 1.x, but if 1.2 does not exist and 1.1.1.x does exist, then you
have an
exception where you need to take the newest revision of 1.1.1.x!
STRANGE!
There are many bugs documented relating to "cvs import". Another
example, from
the official CVS manual:
http://ximbiot.com/cvs/manual/cvs-1.11.22/cvs_13.html
"WARNING: If you use a release tag that already exists in one of the
repository
archives, files removed by an import may not be detected. "

Why do all these problems exist? because "cvs import" concept is
wrong. What is
needed is that the "cvs import" feature will behave as a tool to add
local CVS
controlled files for a given subdirectory that is not version
controled.

The procedure for importing a package (let's assume linux-2.6.18.tgz)
would be
something like this:
$ tar xvzf linux-2.6.18.tgz
$ cd linux
$ jcvs import -b linux-original project/os/linux
$ cvs commit -m "import Linux 2.6.18 from kernel.org"
$ cvs tag linux-2_6_18

and if we want this to be merged to HEAD, we will then also do:
$ cvs up -A
$ cvs up -j HEAD -j linux-2_6_18
  (or may do "cvs up -j HEAD -j linux-original")
$ cvs commit -m "merge Linux 2.6.18"

This is simple logic, that fixes all the problems with the regular
"cvs
import". So: what is the behavior required from this new "jcvs import"
to
make the above sequence work?

It is required to add CVS control files (CVS/Root, CVS/Entries,
CVS/Repository) in a given sub tree that DOES NOT have any CVS control
directory.  It needs to mark all files as "A" (add) if they exist on
the branch
(the -b option supplied to the "cvs import" command), or
"M" (modified) for all
files that already exist on the branch - but their contents do not
match, or
"R" (remove) for all files that do not exist in the sub-tree, but DO
exist in
the repository on that branch.  In order to be able to prepare the
files with
"A", "M" and "R" for "cvs commit", notice 'jcvs import' may also need
to create
empty directories on the CVS server (by doing "cvs add" to the
required missing
directories).

This means that doing the combination of "jcvs import" and then "cvs
commit"
brings the CVS repository in-sync with a tarball!

To sum up this feature: if you have a subtree of files and you want a
subdir in
the repository on a certain branch to be 100% in sync with this
existing
subtree, you do "jcvs import". We also believe that in the long term,
the
original "cvs import" code should be removed, and replaced with code
that
behaves exactly like Jungo's "jcvs import" - since this is the correct
logical
behavior "cvs import" should behave like!

-----------------------------------------------------------------
Problem: slow 'cvs update'
Solution: 'jcvs update': DB-based delta fetching using external
application...

'cvs update' of Jungo main CVS tree involves going over ~250K files.
The average number of commit sessions (commitid) per branch per day is
less
than 10.  The number of whole tree updates on this tree is in the
hundreds.

In Jungo, like in most CVS server setups, we have ViewCVS installed.
So each commit is recorded in the ViewCVS DB (lately renamed to ViewCV
for
politically correctness...).

We have created a wrapper application that keeps a timestamp of last
update and
tries to get the delta from the DB when feasible.

So when you do 'jcvs update' it sends an SQL query to ViewCVS to get
the list
of files modified on the required branch since the last 'jcvs
update' (or since
'cvs checkout' or the tree, if 'jcvs update' was not run yet on this
local
tree).

But what about locally modified files? Well, it also runs "cvsu"
application to
get the list of local changes.

It merges the list of server changes (ViewCVS) and local changes
("cvsu"), and
then it calls 'cvs update' with the SPECIFIC list of files.

The result: 'jcvs update' for 250K files takes under 1 minute, instead
of over
40 minutes that regular 'cvs update' calls take.  The 1 minute is
mainly due to
the time it takes 'cvsu' to scan all the local files for detecting
modifications. This means the load on the CVS server and the network
traffic
are very low!

=======================================================================

Soft Tagging design details
---------------------------------------

The problems
------------
The operation of tagging in CVS is using the file's RCS file to store
the tags
and branches defined for the file.
- Currently we have in the our main CVS repository ~250,000 files.
each tag
  operation needs to write to all these files. This takes hours to
  complete. During this time CVS performance is seriously degraded.
- Having all the tags in the files bloats the files by orders of
  magnitude. We found out that a typical file has 50KB
  worth of tags in it, and on average 5KB of real content.

Solution summary
----------------
- We use 'soft tagging'
- We define soft tags and branches by writing in file CVSROOT/stags
the tag
  name with branch & date.
- We avoid writing of tag in RCS files, so tag time is reduced to
seconds.
- CVS is updated to look for symbolic tag names in soft tags list as
well as in
  the RCS list of symbols.
- Legacy hard tags can later be converted to soft tags at our leisure.
This
  requires analyzing the tags on the files to make sure that they have
a single
  point in time on a specific branch (or, for that matter, a
contiguous series
  of points, i.e. a a specific period of the branch life time), that
all the
  revisions tagged with the tag share.

The solution
------------
- Keep in file CVSROOT/stags list of soft tags:
  - format of line:
    for tags
    T <tag name> <branch>:<date UTC in RCS format> <repository>
    for branches
    B <tag name> <branch>:<date UTC in RCS format> <repository>
  - example: tag-4_0_5 branch-4_0 2007.05.28.11.04.19 # version 4.0.5
  - This defines the tag in terms of date on the branch. This allows
us to
    replace the writing to all the files with a single update in this
file.
    Soft tagging takes seconds instead of hours for hard tagging.
  - instead of branch+date the soft-tag can use the name of another
tag - this
    serves as an aliasing mechanism.
  - repository field is used to make sure that the tag is used only
where it's
    supposed to be used. it can contain either name of top-level
directory, e.g.
    'seamonkey', or a path within them, like
    'seamonkey/browser/components/bookmarks/'.
    - blocking users from doing 'cvs tag' outside the destined
repository can
      be done by adding a filter in CVSROOT/taginfo.
- Users can define the tags as soft tags by adding a line for the tag
in
  CVSROOT/stags. optionally, one can choose (as done in Jungo) to auto-
create
  this file based on contents of an external system defining the tags
and
  versions used (Jungo uses the same system to define names of tags
and
  branches shown in our Bugzilla-based issue-tracking system).
- CVS update is modified to handle selection by soft tags - see
"selecting file
  version by tag" below.
- CVS commit is changed to do hard-branch only where necessary.
  See "Doing soft branching".

Selecting file version by tag
-----------------------------
For each file, we select the version during update using the logic
described
Below. Since checkout is based on update, this covers it as well.
This is implemented in CVS code in file src/rcs.c, function
translate_symtag().
- if parameter to "-r" matches 'dots separated numbers' format:
  - return the version it specifies
- if symbolic tag is in RCS - use the definition from RCS
- if the tag is in CVSROOT/stags, look up matching version on branch
  - if found matching version, return that version
- return NULL

Doing soft branching
--------------------
- On commit:
  - if the file has sticky tag set which is a soft tag of type branch,
and it
    is not found in the file's RCS data:
    - do hard tagging of file, using the version specified in the
stags,
      i.e. add branch for that soft-branch to RCS file.
  - continue as usual
- This assures that branch tags are added to RCS files only when
necessary.
  This way not only soft tagging is immediate but also soft-branching.
  The RCS file needs to be written to only when a commit to a file
requires a
  real revision on the branch to be created.

=======================================================================

Any comments and feedback are welcome.

Final notes:
------------------
- I will post a patch with all our CVS changes (diff vs. GNU CVS
1.12.13) to the CVS project in
  GNU Savannah site (https://savannah.nongnu.org/patch/?group=cvs).

 - work on CVS improvements in Jungo began in 2003.
   People involved, apart from me, included Derry Shribman & Or Tal.

Yaron Yogev <[EMAIL PROTECTED]>
Jungo LTD

_______________________________________________
Bug-cvs mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/bug-cvs

CVS soft-tagging + other problems & solutions

Reply via email to