Re: Idea for reducing disk IO on tagging operations
* Doug Lee ([EMAIL PROTECTED]) wrote: > I followed this discussion only loosely and kept silent because I > suspect someone will shoot me to pieces for the complaint I'm about to > make, but now that we're to the stage of actual implementation, I > guess I'd like to say this anyway... Hey that's OK. > I have reservations about any system that makes whitespace significant > in a text file. I can make an exception for indent levels, as used by > Python, because these are visible and errors are obvious without > resorting to odd tactics like hex editors, vi's :list command, etc. Let me make it clear that this patch *in no way* makes whitespace significant; in actual fact it only works because it isn't significant. What it does is put a glob of whitespace in when it is convenient; nothing changes in the parsing or anything - so just like before that whitespace is completely ignored. The trick is that when it comes to add a tag it checks to see if there is spare white space and if so overwrites it; if you removed the white space or otherwise fettled with the file that is fine; it won't perform the optimisation. Indeed this means that an existing cvs client can quite happily read a repository which has had my patch inflicted on it. (The existing cvs code that rewrites the file will remove any excess white space you added up there anyway.) Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
Hi, Well, I've had a crack at implementing the optimisation; and attached is a patch which seems to work - but there is at least one nasty hack in it; more about that in a sec. To enable it you need to add: TagOverwriteEnable=yes to the config file in the CVSROOT; without that it should not change behaviour in any way (except adding that as a commented out option with warning to a newly created repository). It won't give you any performance benefit on the first tag, but should give something on subsequent tags. I see some improvement (~15%) but it is variable, on a large repository that doesn't fit in memory on my home machine. It is my first dig into the CVS code base, so I would appreciate (gentle) comments. Now some details; 1) The real nasty hack; this is something that I hadn't thought of (and I don't think anyone else noticed?) in my original description; the permissions on the rcs files is read only so when I need to open them to overwrite I can't - this is a pain; this patch has a gratuitous (and obviously WRONG) hack in of chmod'ing it before the open - I'm open for any suggestions *if* there is a right way of doing this. (This was a pain because it was at the very last stage of the patch that I noticed this!). 2) I don't currently create the dummy ,foo, locking file. 3) I haven't written any docs yet. 4) I needed to get a couple of values out of rcsbuf_getkey and have shoved them in globals for the moment; I was looking for a neater way that wouldn't mean changing all the callers. 5) I'm worried about the right types to use for file offsets in a portable way. (Has anyone tried cvs with rcs files over 2GB?) The patch is against 1.12.9 which is the version my debian happened to have. As I say, suggestions - and experiences welcome. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ diff -ur orig/cvs-1.12.9/ChangeLog cvs-1.12.9/ChangeLog --- orig/cvs-1.12.9/ChangeLog 2004-06-09 15:52:32.0 +0100 +++ cvs-1.12.9/ChangeLog2005-03-24 23:43:48.0 + @@ -1,3 +1,6 @@ +2005-03-24 Dave Gilbert <[EMAIL PROTECTED]> + * Added fast tagging mechanism; rcs.h/c, parseinfo.c,mkmodules.c + 2004-06-09 Derek Price <[EMAIL PROTECTED]> * NEWS: Note Stefan & Sebastian's security fixes. diff -ur orig/cvs-1.12.9/src/admin.c cvs-1.12.9/src/admin.c --- orig/cvs-1.12.9/src/admin.c 2004-03-22 15:37:34.0 + +++ cvs-1.12.9/src/admin.c 2005-03-27 20:39:38.0 +0100 @@ -792,7 +792,7 @@ || (rev = RCS_tag2rev (rcs, p))) /* tag2rev may exit */ { RCS_check_tag (tag); /* exit if not a valid tag */ - RCS_settag (rcs, tag, rev); + RCS_settag (rcs, tag, rev, NULL); free (rev); } else diff -ur orig/cvs-1.12.9/src/commit.c cvs-1.12.9/src/commit.c --- orig/cvs-1.12.9/src/commit.c2004-06-09 15:52:37.0 +0100 +++ cvs-1.12.9/src/commit.c 2005-03-27 20:39:45.0 +0100 @@ -2144,7 +2144,7 @@ head = RCS_getversion (rcs, NULL, NULL, 0, (int *) NULL); magicrev = RCS_magicrev (rcs, head); - retcode = RCS_settag (rcs, tag, magicrev); + retcode = RCS_settag (rcs, tag, magicrev, NULL); RCS_rewrite (rcs, NULL, NULL); free (head); diff -ur orig/cvs-1.12.9/src/import.c cvs-1.12.9/src/import.c --- orig/cvs-1.12.9/src/import.c2004-04-27 22:08:40.0 +0100 +++ cvs-1.12.9/src/import.c 2005-03-27 20:39:59.0 +0100 @@ -770,7 +770,7 @@ if (noexec) return (0); -if ((retcode = RCS_settag(rcs, vtag, vbranch)) != 0) +if ((retcode = RCS_settag(rcs, vtag, vbranch, NULL)) != 0) { ierrno = errno; fperrmsg (logfp, 0, retcode == -1 ? ierrno : 0, @@ -792,7 +792,7 @@ vers = Version_TS (&finfo, NULL, vtag, NULL, 1, 0); for (i = 0; i < targc; i++) { - if ((retcode = RCS_settag (rcs, targv[i], vers->vn_rcs)) == 0) + if ((retcode = RCS_settag (rcs, targv[i], vers->vn_rcs, NULL)) == 0) RCS_rewrite (rcs, NULL, NULL); else { diff -ur orig/cvs-1.12.9/src/mkmodules.c cvs-1.12.9/src/mkmodules.c --- orig/cvs-1.12.9/src/mkmodules.c 2004-05-29 05:48:52.0 +0100 +++ cvs-1.12.9/src/mkmodules.c 2005-03-24 23:43:38.0 + @@ -349,6 +349,23 @@ "# Be warned that these strings could be disabled in any new version of CVS.\n", "UseNewInfoFmtStrings=yes\n", #endif /* SUPPORT_OLD_INFO_FMT_STRINGS */ +"# Options relating to the Tag overwrit
Re: Idea for reducing disk IO on tagging operations
* Jim Hyslop ([EMAIL PROTECTED]) wrote: > Dr. David Alan Gilbert wrote: > > 2) I could do with a better under standing of the directory locks; > > pointers? I've read the top of lock.c but it still doesn't tell me > > enough; for example there seem to be multiple lock files used - but > > then surely the creation of them isn't atomic? Or is there one lock > > file used for both reading and writing? > The locking process is explained in the manual, at > https://www.cvshome.org/docs/manual/cvs-1.11.19/cvs_2.html#SEC17 Thanks Jim for pointing me at that (I'd had a good search through the FAQ rather than the manual). (and to Paul - apologies if I misquoted in that last email) OK; this convinces me that I don't need to worry about cvs reading my file while it is being modified. Together with the restriction of me only performing my trick if the write is entirely within a block then I feel reasonably safe. I'm going to have a crack at making this optimisation and will forward a copy here for discussion when I've done it. Dave -----Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Mark D. Baushke ([EMAIL PROTECTED]) wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Paul Sander <[EMAIL PROTECTED]> writes: > > Actually, if you look closely, I believe that CVS will not do read-only > RCS operations if a CVS write-lock exists for the directory. Of course, > ViewCVS and CVSweb do it all the time as do many of the other add-ons. I'm getting more worried about this one for 2 seperate reasons: 1) There is talk of cvs -n for diff and the like which seems to suggest it ignores locks. 2) I could do with a better under standing of the directory locks; pointers? I've read the top of lock.c but it still doesn't tell me enough; for example there seem to be multiple lock files used - but then surely the creation of them isn't atomic? Or is there one lock file used for both reading and writing? > > There's also the interrupt issue: Killing an update before it > > completes leaves the RCS file corrupt. You'd have to build in some > > kind of crash recovery. But RCS already has that by way of the comma > > file, which can simply be deleted. Other crash recovery algorithms > > usually involve transaction logs that can be reversed and replayed, or > > the creation of backup copies. None of these are more efficient than > > the existing RCS update protocol. > > Agreed. This is a very big deal. Actually I'm becoming less worried by this; I'm failing to see any way that a single system call write() to a block not crossing a block boundary could partially fail; but I'm up for suggestions. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Mark D. Baushke ([EMAIL PROTECTED]) wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Dr. David Alan Gilbert <[EMAIL PROTECTED]> writes: > > > > OK, if I create a dummy ",foo.c," before > > modifying (or create a hardlink with that name > > to foo.c,v ?) would that be sufficient? > > I would say that it is likely necessary, but may > not be sufficient. Hmm ok. > > Or perhaps create the ,foo,c, as I normally > > would - but if I can use this overwrite trick on > > the original then I just delete the ,foo.c, > > file. > > I am unclear how this lets you perform a speedup. I only create the ,foo.c, file - I don't write anything into it; the existence of the file is enough to act as the RCS lock; if I can do my inplace modification then I delete this file after doing it, if not then I proceed as normal and just write the ,foo.c, file and do the rename as you normally would. > > Is the problem that things are allowed to read > > the original foo.c,v while you are creating the > > new version? > > I am given to understand that many of the > anicillary tools that surround CVS make use of > being able to read a consistent ,v file at all > times. This is very tricky; I don't think in our case we use any such tools (we might have a cvs/web thing for browsing it, but this is probably not critical); and as long I can guarentee what I do is safe as far as CVS itself is concerned I think I'd be prepared to go for it as a configurable mechanism. > > So the issue is what happens if the interrupt > > occurs as I'm overwriting the white space to add > > a tag; hmm yes; > > Correct. Depending on the filesystem kind and the > level of I/O, your rewrite could impact up to three > fileblocks and the directory data. > > > is it possible to guard against this by using a > > single call to write(2) for that? > > Not for all possible filesystem types. > > > Is that the problem you are thinking of? > > Yes. Even worse things can happen in this regard > if the filesystem is a 'stateless' one such as an > NFS mounted directory (we keep advising folks > against using them, but I know for a fact that > they are still used). OK, my conscience will let me carefully ignore NFS issues given the pain it causes me elsewhere (and I make my mechanism switchable). What happens if I only used the overwrite mechanism if none of the characters being modified crossed a 512 (e.g.) byte boundary offset in the file? Since the spaces were actually written in a previous operation we can assume that the space is allocated and no allocation operation is going to happen at this point (mumble filesystem journalling mumble!). > > Sure, seperating the tagging data out is much > > neater; but what I was looking for here was a > > simple speed up which didn't require anything > > extra and would be fully compatible with > > existing tools. > > And you are finding that existing tools torture > the assumptions you are able to make about the CVS > repository. Nod; it is quite painful! > FWIW: (In my personal experience) using a SAN > solution for your repository storage allows you > much better throughput for all write operations in > the general case as the SAN can guarentee the > writes are okay before the disk actually does it. But when you throw a GB of writes at them in a short time from a tag accross our whole repository they aren't going to be happy - they are going to want to get rid of that backlog of write data ASAP. > Optimizing for tagging does not seem very useful > to me as we typically do not drop that many tags > on our repository. In the company I work for we are very tag heavy, but more importantly it is the tagging that gets in peoples way and places the strain on the write bandwidth of the discs/RAID. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Paul Sander ([EMAIL PROTECTED]) wrote: Hi Paul, Thanks for the reply, > Everything that Mark says is true. I'll add that some shops optimize > their read operations under certain conditions, and such optimizations > would break if the RCS files are updated in-place. > > What happens is that, if the version of every file can be identified in > advance (using version number, tag, or branch/timestamp pair) then they > can invoke RCS directly to fetch file versions, read metadata, and so > on. This sidesteps CVS' overhead and can increase performance by as So are these tricks *never* performed by cvs itself? i.e. would my trick (if I can solve the interrupted write case) be completely safe with any use of cvs as long as you didn't access the files externally? Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
[Resend: I sent it with the wrong 'from' address - apologies if you get both] * Mark D. Baushke ([EMAIL PROTECTED]) wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > Hi Mark, Thanks for your reply. > Dr. David Alan Gilbert <[EMAIL PROTECTED]> writes: > > > So - here are my questions/ideas - I'd appreciate comments to tell > > me whether I'm on the right lines: > > 1) As I understand it the tag data is the > > first of the 3 main data structures in the RCS > > file (tag, comments, diffs) and that when I do > > pretty much any CVS operation I rewrite the > > whole file - is this correct? > > CVS write operations on a foo.c,v repository file > will write ,foo.c, and then when the write > operation is successful and without any errors, it > does a rename (",foo.c,", "foo.c,v"); to make the > new version the official version. While the > ,foo.c, file exists, RCS commands will consider > the file locked. > > It is desirable to use RCS write semanitcs as many > other tools out there (cf, ViewCVS) use RCS on the > repository and want to obey RCS locking. OK, if I create a dummy ",foo.c," before modifying (or create a hardlink with that name to foo.c,v ?) would that be sufficient? Or perhaps create the ,foo,c, as I normally would - but if I can use this overwrite trick on the original then I just delete the ,foo.c, file. Is the problem that things are allowed to read the original foo.c,v while you are creating the new version? > be configured). So, yes, whitespace is mostly > irelevent between sections. Great. > > 3) So the idea is that when I add a tag I add > > a bunch of white space after the tag (lets say > > 1KB of spaces split into 64 byte lines or > > similar); when I come to add the next tag I > > check if there is plenty of white space, if > > there is then instead of rewriting the file I > > just overwrite the white space with my new tag > > data; if there is no space then as I rewrite > > the file I add another lump of white space. > > This has the potential to more easily corrupt the > RCS file if the operation is interrupted for any > reason. The act of rewriting adding extra space would be performed using the existing mechanism (with just some extra add space created in RCS_rewrite); so that can't be a problem. So the issue is what happens if the interrupt occurs as I'm overwriting the white space to add a tag; hmm yes; is it possible to guard against this by using a single call to write(2) for that? Is that the problem you are thinking of? > It would be more robust to enhance CVS to use an > external database for tagging information instead > of putting the tagging information into the RCS > files directly than to rewrite parts of the RCS > file and hope that the operation didn't corrupt > the file along the way. Sure, seperating the tagging data out is much neater; but what I was looking for here was a simple speed up which didn't require anything extra and would be fully compatible with existing tools. > You may wish to consider looking at Meta-CVS as I > believe that Kaz keeps a lot of the branching > information outside of the RCS files already. > > See http://users.footprints.net/~kaz/mcvs.html > for more details on Meta-CVS. If I was changing to another tool then I'd have a much larger set of tools to consider (e.g. subversion) but I'd rather stick with plain CVS if I can - I've got clients on lots of (weird) OSs that work via pserver and an infinite number of scripts built around CVS. Thanks for the reply, Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Idea for reducing disk IO on tagging operations
Hi, I maintain a system that is used to hold a rather large CVS repository (~1GB give or take) which could do with being faster. Tagging in particular is slow and I don't think cpu or ram is the issue (it is a dual xeon with 3GB of RAM). My suspicion is that at least one of the problems is that when a tag is added most of the rcs files are rewritten giving a sudden large amount of data that must be written to disc. So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? 2) White space appears to be irrelevent in RCS files; so adding arbitrary amounts in between sections should leave files still fully compatible with existing RCS/cvs tools. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. 4) Whether dummy white space is added and how much is controlled by the existing size of the RCS file; so an RCS file that is only a few KB wont have any space added; that way this mechanism doesn't slow down/bloat small repositories. The amount of white space might be chosen to align data structures with disk block boundaries. 5) My main concern is to do with concurrency/consistency requirements; is the file rewrite essential to ensure consistency, or is the locking that is carried out sufficient? Does this make sense? Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs