Re: Idea for reducing disk IO on tagging operations

2005-03-28 Thread Dr. David Alan Gilbert
* Doug Lee ([EMAIL PROTECTED]) wrote:
> I followed this discussion only loosely and kept silent because I
> suspect someone will shoot me to pieces for the complaint I'm about to
> make, but now that we're to the stage of actual implementation, I
> guess I'd like to say this anyway...

Hey that's OK.

> I have reservations about any system that makes whitespace significant
> in a text file.  I can make an exception for indent levels, as used by
> Python, because these are visible and errors are obvious without
> resorting to odd tactics like hex editors, vi's :list command, etc.

Let me make it clear that this patch *in no way* makes whitespace
significant; in actual fact it only works because it isn't
significant.

What it does is put a glob of whitespace in when it is convenient;
nothing changes in the parsing or anything - so just like before
that whitespace is completely ignored.

The trick is that when it comes to add a tag it checks to see if there
is spare white space and if so overwrites it; if you removed
the white space or otherwise fettled with the file that is fine;
it won't perform the optimisation.

Indeed this means that an existing cvs client can quite happily
read a repository which has had my patch inflicted on it.

(The existing cvs code that rewrites the file will remove any
excess white space you added up there anyway.)

Dave
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Re: Idea for reducing disk IO on tagging operations

2005-03-28 Thread Dr. David Alan Gilbert
Hi,
  Well, I've had a crack at implementing the optimisation; and attached
is a patch which seems to work - but there is at least one nasty
hack in it; more about that in a sec.

To enable it you need to add:
  TagOverwriteEnable=yes
to the config file in the CVSROOT; without that it should not
change behaviour in any way (except adding that as a commented
out option with warning to a newly created repository).

It won't give you any performance benefit on the first tag, but should
give something on subsequent tags.  I see some improvement (~15%)
but it is variable, on a large repository that doesn't fit in
memory on my home machine.

It is my first dig into the CVS code base, so I would appreciate
(gentle) comments.

Now some details;
  1) The real nasty hack; this is something that I hadn't thought
  of (and I don't think anyone else noticed?) in my original
  description; the permissions on the rcs files is read only
  so when I need to open them to overwrite I can't - this is a pain;
  this patch has a gratuitous (and obviously WRONG) hack in of
  chmod'ing it before the open - I'm open for any suggestions *if*
  there is a right way of doing this. (This was a pain because
  it was at the very last stage of the patch that I noticed this!).

  2) I don't currently create the dummy ,foo, locking file.

  3) I haven't written any docs yet.

  4) I needed to get a couple of values out of rcsbuf_getkey and
  have shoved them in globals for the moment; I was looking for a
  neater way that wouldn't mean changing all the callers.

  5) I'm worried about the right types to use for file offsets
  in a portable way. (Has anyone tried cvs with rcs files over 2GB?)

The patch is against 1.12.9 which is the version my debian happened to
have.

As I say, suggestions - and experiences welcome.

Dave
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/
diff -ur orig/cvs-1.12.9/ChangeLog cvs-1.12.9/ChangeLog
--- orig/cvs-1.12.9/ChangeLog   2004-06-09 15:52:32.0 +0100
+++ cvs-1.12.9/ChangeLog2005-03-24 23:43:48.0 +
@@ -1,3 +1,6 @@
+2005-03-24  Dave Gilbert <[EMAIL PROTECTED]>
+  * Added fast tagging mechanism; rcs.h/c, parseinfo.c,mkmodules.c
+
 2004-06-09  Derek Price  <[EMAIL PROTECTED]>
 
* NEWS: Note Stefan & Sebastian's security fixes.
diff -ur orig/cvs-1.12.9/src/admin.c cvs-1.12.9/src/admin.c
--- orig/cvs-1.12.9/src/admin.c 2004-03-22 15:37:34.0 +
+++ cvs-1.12.9/src/admin.c  2005-03-27 20:39:38.0 +0100
@@ -792,7 +792,7 @@
 || (rev = RCS_tag2rev (rcs, p))) /* tag2rev may exit */
{
RCS_check_tag (tag); /* exit if not a valid tag */
-   RCS_settag (rcs, tag, rev);
+   RCS_settag (rcs, tag, rev, NULL);
free (rev);
}
 else
diff -ur orig/cvs-1.12.9/src/commit.c cvs-1.12.9/src/commit.c
--- orig/cvs-1.12.9/src/commit.c2004-06-09 15:52:37.0 +0100
+++ cvs-1.12.9/src/commit.c 2005-03-27 20:39:45.0 +0100
@@ -2144,7 +2144,7 @@
head = RCS_getversion (rcs, NULL, NULL, 0, (int *) NULL);
magicrev = RCS_magicrev (rcs, head);
 
-   retcode = RCS_settag (rcs, tag, magicrev);
+   retcode = RCS_settag (rcs, tag, magicrev, NULL);
RCS_rewrite (rcs, NULL, NULL);
 
free (head);
diff -ur orig/cvs-1.12.9/src/import.c cvs-1.12.9/src/import.c
--- orig/cvs-1.12.9/src/import.c2004-04-27 22:08:40.0 +0100
+++ cvs-1.12.9/src/import.c 2005-03-27 20:39:59.0 +0100
@@ -770,7 +770,7 @@
 if (noexec)
return (0);
 
-if ((retcode = RCS_settag(rcs, vtag, vbranch)) != 0)
+if ((retcode = RCS_settag(rcs, vtag, vbranch, NULL)) != 0)
 {
ierrno = errno;
fperrmsg (logfp, 0, retcode == -1 ? ierrno : 0,
@@ -792,7 +792,7 @@
 vers = Version_TS (&finfo, NULL, vtag, NULL, 1, 0);
 for (i = 0; i < targc; i++)
 {
-   if ((retcode = RCS_settag (rcs, targv[i], vers->vn_rcs)) == 0)
+   if ((retcode = RCS_settag (rcs, targv[i], vers->vn_rcs, NULL)) == 0)
RCS_rewrite (rcs, NULL, NULL);
else
{
diff -ur orig/cvs-1.12.9/src/mkmodules.c cvs-1.12.9/src/mkmodules.c
--- orig/cvs-1.12.9/src/mkmodules.c 2004-05-29 05:48:52.0 +0100
+++ cvs-1.12.9/src/mkmodules.c  2005-03-24 23:43:38.0 +
@@ -349,6 +349,23 @@
 "# Be warned that these strings could be disabled in any new version of 
CVS.\n",
 "UseNewInfoFmtStrings=yes\n",
 #endif /* SUPPORT_OLD_INFO_FMT_STRINGS */
+"# Options relating to the Tag overwrit

Re: Idea for reducing disk IO on tagging operations

2005-03-23 Thread Dr. David Alan Gilbert
* Jim Hyslop ([EMAIL PROTECTED]) wrote:
> Dr. David Alan Gilbert wrote:
> >  2) I could do with a better under standing of the directory locks;
> >  pointers? I've read the top of lock.c but it still doesn't tell me
> >  enough; for example there seem to be multiple lock files used - but
> >  then surely the creation of them isn't atomic? Or is there one lock
> >  file used for both reading and writing?
> The locking process is explained in the manual, at 
> https://www.cvshome.org/docs/manual/cvs-1.11.19/cvs_2.html#SEC17

Thanks Jim for pointing me at that (I'd had a good search through
the FAQ rather than the manual).

(and to Paul - apologies if I misquoted in that last email)

OK; this convinces me that I don't need to worry about cvs reading
my file while it is being modified.  Together with the restriction
of me only performing my trick if the write is entirely within
a block then I feel reasonably safe.

I'm going to have a crack at making this optimisation and will
forward a copy here for discussion when I've done it.

Dave
 -----Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Re: Idea for reducing disk IO on tagging operations

2005-03-22 Thread Dr. David Alan Gilbert
* Mark D. Baushke ([EMAIL PROTECTED]) wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Paul Sander <[EMAIL PROTECTED]> writes:
> 
> Actually, if you look closely, I believe that CVS will not do read-only
> RCS operations if a CVS write-lock exists for the directory. Of course,
> ViewCVS and CVSweb do it all the time as do many of the other add-ons.

I'm getting more worried about this one for 2 seperate reasons:
  1) There is talk of cvs -n for diff and the like which seems to
  suggest it ignores locks.
  2) I could do with a better under standing of the directory locks;
  pointers? I've read the top of lock.c but it still doesn't tell me
  enough; for example there seem to be multiple lock files used - but
  then surely the creation of them isn't atomic? Or is there one lock
  file used for both reading and writing?


> > There's also the interrupt issue:  Killing an update before it
> > completes leaves the RCS file corrupt.  You'd have to build in some
> > kind of crash recovery.  But RCS already has that by way of the comma
> > file, which can simply be deleted.  Other crash recovery algorithms
> > usually involve transaction logs that can be reversed and replayed, or
> > the creation of backup copies.  None of these are more efficient than
> > the existing RCS update protocol.
> 
> Agreed. This is a very big deal.

Actually I'm becoming less worried by this; I'm failing to see any way
that a single system call write() to a block not crossing a block
boundary could partially fail; but I'm up for suggestions.

Dave

 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Re: Idea for reducing disk IO on tagging operations

2005-03-20 Thread Dr. David Alan Gilbert
* Mark D. Baushke ([EMAIL PROTECTED]) wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Dr. David Alan Gilbert <[EMAIL PROTECTED]> writes:
> > 
> > OK, if I create a dummy ",foo.c," before
> > modifying (or create a hardlink with that name
> > to foo.c,v ?) would that be sufficient?
> 
> I would say that it is likely necessary, but may
> not be sufficient.

Hmm ok.

> > Or perhaps create the ,foo,c, as I normally
> > would - but if I can use this overwrite trick on
> > the original then I just delete the ,foo.c,
> > file.
> 
> I am unclear how this lets you perform a speedup.

I only create the ,foo.c, file - I don't write anything into it; the
existence of the file is enough to act as the RCS lock; if I can do my
inplace modification then I delete this file after doing it, if not then
I proceed as normal and just write the ,foo.c, file and do the rename
as you normally would.

> > Is the problem that things are allowed to read
> > the original foo.c,v while you are creating the
> > new version?
> 
> I am given to understand that many of the
> anicillary tools that surround CVS make use of
> being able to read a consistent ,v file at all
> times.

This is very tricky; I don't think in our case we use any such tools
(we might have a cvs/web thing for browsing it, but this is probably
not critical); and as long I can guarentee what I do is safe as far
as CVS itself is concerned I think I'd be prepared to go for it as a
configurable mechanism.

> > So the issue is what happens if the interrupt
> > occurs as I'm overwriting the white space to add
> > a tag; hmm yes; 
> 
> Correct. Depending on the filesystem kind and the
> level of I/O, your rewrite could impact up to three
> fileblocks and the directory data.
> 
> > is it possible to guard against this by using a
> > single call to write(2) for that? 
> 
> Not for all possible filesystem types.
> 
> > Is that the problem you are thinking of?
> 
> Yes. Even worse things can happen in this regard
> if the filesystem is a 'stateless' one such as an
> NFS mounted directory (we keep advising folks
> against using them, but I know for a fact that
> they are still used).

OK, my conscience will let me carefully ignore NFS issues given the
pain it causes me elsewhere (and I make my mechanism switchable).
What happens if I only used the overwrite mechanism if
none of the characters being modified crossed a 512 (e.g.) byte
boundary offset in the file?  Since the spaces were actually
written in a previous operation we can assume that the space
is allocated and no allocation operation is going to happen
at this point (mumble filesystem journalling mumble!).

> > Sure, seperating the tagging data out is much
> > neater; but what I was looking for here was a
> > simple speed up which didn't require anything
> > extra and would be fully compatible with
> > existing tools.
> 
> And you are finding that existing tools torture
> the assumptions you are able to make about the CVS
> repository.

Nod; it is quite painful!

> FWIW: (In my personal experience) using a SAN
> solution for your repository storage allows you
> much better throughput for all write operations in
> the general case as the SAN can guarentee the
> writes are okay before the disk actually does it.

But when you throw a GB of writes at them in a short time from a tag
accross our whole repository they aren't going to be happy - they are
going to want to get rid of that backlog of write data ASAP.

> Optimizing for tagging does not seem very useful
> to me as we typically do not drop that many tags
> on our repository.

In the company I work for we are very tag heavy, but more importantly
it is the tagging that gets in peoples way and places the strain
on the write bandwidth of the discs/RAID.

Dave
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Re: Idea for reducing disk IO on tagging operations

2005-03-20 Thread Dr. David Alan Gilbert
* Paul Sander ([EMAIL PROTECTED]) wrote:

Hi Paul,
  Thanks for the reply,

> Everything that Mark says is true.  I'll add that some shops optimize 
> their read operations under certain conditions, and such optimizations 
> would break if the RCS files are updated in-place.
> 
> What happens is that, if the version of every file can be identified in 
> advance (using version number, tag, or branch/timestamp pair) then they 
> can invoke RCS directly to fetch file versions, read metadata, and so 
> on.  This sidesteps CVS' overhead and can increase performance by as 

So are these tricks *never* performed by cvs itself? i.e. would my
trick (if I can solve the interrupted write case) be completely
safe with any use of cvs as long as you didn't access the files
externally?

Dave
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Re: Idea for reducing disk IO on tagging operations

2005-03-20 Thread Dr. David Alan Gilbert
[Resend: I sent it with the wrong 'from' address - apologies
if you get both]

* Mark D. Baushke ([EMAIL PROTECTED]) wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 

Hi Mark,
  Thanks for your reply.

> Dr. David Alan Gilbert <[EMAIL PROTECTED]> writes:
> 
> > So - here are my questions/ideas - I'd appreciate comments to tell
> > me whether I'm on the right lines:
> >   1) As I understand it the tag data is the
> >   first of the 3 main data structures in the RCS
> >   file (tag, comments, diffs) and that when I do
> >   pretty much any CVS operation I rewrite the
> >   whole file - is this correct?
> 
> CVS write operations on a foo.c,v repository file
> will write ,foo.c, and then when the write
> operation is successful and without any errors, it
> does a rename (",foo.c,", "foo.c,v"); to make the
> new version the official version. While the
> ,foo.c, file exists, RCS commands will consider
> the file locked.
> 
> It is desirable to use RCS write semanitcs as many
> other tools out there (cf, ViewCVS) use RCS on the
> repository and want to obey RCS locking.

OK, if I create a dummy ",foo.c," before modifying (or create a hardlink
with that name to foo.c,v ?)  would that be sufficient?  Or perhaps create
the ,foo,c, as I normally would - but if I can use this overwrite trick
on the original then I just delete the ,foo.c, file.  Is the problem that
things are allowed to read the original foo.c,v while you are creating
the new version?

> be configured). So, yes, whitespace is mostly
> irelevent between sections.

Great.

> >   3) So the idea is that when I add a tag I add
> >   a bunch of white space after the tag (lets say
> >   1KB of spaces split into 64 byte lines or
> >   similar); when I come to add the next tag I
> >   check if there is plenty of white space, if
> >   there is then instead of rewriting the file I
> >   just overwrite the white space with my new tag
> >   data; if there is no space then as I rewrite
> >   the file I add another lump of white space.
> 
> This has the potential to more easily corrupt the
> RCS file if the operation is interrupted for any
> reason.

The act of rewriting adding extra space would be performed using the existing
mechanism (with just some extra add space created in RCS_rewrite);
so that can't be a problem.

So the issue is what happens if the interrupt occurs as I'm overwriting
the white space to add a tag; hmm yes; is it possible to guard against
this by using a single call to write(2) for that?  Is that the problem
you are thinking of?

> It would be more robust to enhance CVS to use an
> external database for tagging information instead
> of putting the tagging information into the RCS
> files directly than to rewrite parts of the RCS
> file and hope that the operation didn't corrupt
> the file along the way.

Sure, seperating the tagging data out is much neater; but what I was
looking for here was a simple speed up which didn't require anything
extra and would be fully compatible with existing tools.

> You may wish to consider looking at Meta-CVS as I
> believe that Kaz keeps a lot of the branching
> information outside of the RCS files already.
> 
> See http://users.footprints.net/~kaz/mcvs.html
> for more details on Meta-CVS.

If I was changing to another tool then I'd have a much larger set of
tools to consider (e.g.  subversion) but I'd rather stick with plain CVS
if I can - I've got clients on lots of (weird) OSs that work via pserver
and an infinite number of scripts built around CVS.

Thanks for the reply,

Dave
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs


Idea for reducing disk IO on tagging operations

2005-03-20 Thread Dr. David Alan Gilbert
Hi,
  I maintain a system that is used to hold a rather large
CVS repository (~1GB give or take) which could do with being faster.
Tagging in particular is slow and I don't think cpu or ram is the
issue (it is a dual xeon with 3GB of RAM).

My suspicion is that at least one of the problems is that when
a tag is added most of the rcs files are rewritten giving a sudden
large amount of data that must be written to disc.

So - here are my questions/ideas - I'd appreciate comments to tell
me whether I'm on the right lines:
  1) As I understand it the tag data is the first of the 3 main
  data structures in the RCS file (tag, comments, diffs) and that
  when I do pretty much any CVS operation I rewrite the whole file -
  is this correct?

  2) White space appears to be irrelevent in RCS files; so adding
  arbitrary amounts in between sections should leave files still
  fully compatible with existing RCS/cvs tools.

  3) So the idea is that when I add a tag I add a bunch of white
  space after the tag (lets say 1KB of spaces split into 64 byte
  lines or similar); when I come to add the next tag I check if
  there is plenty of white space, if there is then instead of
  rewriting the file I just overwrite the white space with my
  new tag data; if there is no space then as I rewrite the
  file I add another lump of white space.

  4) Whether dummy white space is added and how much is controlled
  by the existing size of the RCS file; so an RCS file that is only
  a few KB wont have any space added; that way this mechanism doesn't
  slow down/bloat small repositories.  The amount of white space might
  be chosen to align data structures with disk block boundaries.

  5) My main concern is to do with concurrency/consistency requirements;
  is the file rewrite essential to ensure consistency, or is the
  locking that is carried out sufficient?
  
Does this make sense?

Dave

 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/


___
Info-cvs mailing list
Info-cvs@gnu.org
http://lists.gnu.org/mailman/listinfo/info-cvs