On Sun, 3 Dec 2006, Nick Guenther wrote:

> > the speed problem is going to be in decrypting large
> > files from the vnd to calculate diffs
> Well that's what timestamps are for! You only need to diff files that
> differ in their timestamps, and decompressing the parts of the vnd
> that contain the timestamps is much less taxing than decompressing the
> entire files. How do you think cvs works? By comparing every file
> every time?

Sometimes I get the feeling that cvs issues requests for proposals
to write the damn file I want.  I have no idea how cvs works with
entire directories of data.  When I do a cvs up on /usr/src,
a great deal of information is passed to the server.  Presumably
these are cksums or modification dates.

Yes, I know what you're talking about.

Yes, I do think that cvs compares every file every time a changed
file is checked back in.  And I do believe it applies a series of diffs
against some sort of root version everytime a file is checked out.

> >  Also, we don't know the internal structure
> > of these files, and how they change.  Rewritten? 10 random bytes
> > changeing in a 1G file?  SQL stuff?  Better not to know or guess.
> 
> Why is this relevant?

Because the design of a solution will depend on the nature of the
data, instead of trying to accommodate all kinds of files, with all
kinds of data.  perhaps an optimization is possible?  We are trying
to be "optimum" in some sense.  Sometimes special cases generate
marvelously clever solutions.  (Suppose there are two datasets,
A and B and B is created by running an arbitrary C program on A.
All that need be stored is the C program, or a set of switches to
an unchanging C program.  Depends on how expensive running the 
program is. etc etc -- typical example: B is the fast Fourier
transform of A.  Another: A is a time history of say many parameters
over a long time.  B is an extract of some parameters over a part of
the time history; There is never a need to archive B, only the parameters
selected, the time interval and sampling frequency and the name of A 
need to be archived.)

> > The advantage
> > I see to rdist is that it results in a mirror of various directory
> > trees on the USB disk, making any manual inspection or restoration
> > "self-documenting".
> 
> Doesn't cvs do this too?

No, not as I see this developing.  Cvs doesn't mirror the working
data sets.

I cannot imagine a method using cvs in which the cvs repository
does not lie on the "USB" (archive) disk.  I feel uneasy with that.
The physical security of the disk is not guaranteed.  It could be
corrupted by random enemies or negligent cow-orkers. ("I will borrow
Vim's disk for my pr0n. Hmmm, what is this big junk file?" or more
likely, "I will borrow Vim's disk.  <tries to mount it on Windoze
box, it fails because it's a righteous BSD disk> I will fix this
error by clicking on 'reformat'.")

We are not interested, either, as I understand the problem, in
keeping versions of the file.  Thus cvs/svn is overkill.  Vim may
wish, however, to keep versions in future.  I think the tentative
solution I have (and will exhibit for criticism in another post)
will be "modular" enough to allow for a changeover to some sort of
versioning software.  (rdist will do a sort of versioning, too).

At present the job is being done by (approximately):

        mount usbdisk /mnt
        tar czpf - list-of-directories | crytpo program >/mnt/name.tgz.cr

The motivation for a new method  is the execution time of that pipe.

Size of name.tgz.cr as presently reported is about 4 GB.

We have plenty of disk space on the archive disk and the working disk.
Disk space can be spent freely.

> > Yet the idea of storing diffs of some kind is
> > very enticing, especially if they could be computed on the fly,
> > stored on the "master" (the laptop's main disk[s]) and saved to
> > "slave" (the USB) batch-wise.
> 
> This sounds like there's lots of room for breakage. So to restore you
> have to apply each and every diff to each file one after the other?

Well, no.  cvs does essentially this, right?

> And how would they be computed on the fly?

Computed during the day, before "backup".   One might check
everything in to a local repository, and mirror the repository.
I am not looking into that method, though.

>                                            Some sort of kernel hook
> that notices whenever a file is accessed?

I think you haven't paid much attention to what I'm proposing, or I've
obscured what I want with a bunch of rambling drivel. 

I'm proposing the use of standard tools, not some ginned up weird
script using "diff -u" or something.  When I say "store diffs" I'm
referring to the internal action of cvs/svn.

The kernel hook you mention is called the "atime". (man 2 stat).

mtime (modification time) is better, though.

Rdist can work with a list of mtimes, BTW.  Such a list could be
updated by a cron job.  I'm not proposing this at this time, though.
(This might be version 1.1).

> > My long-winded posts are serving to educate first me and then Vim
> > (and anybody else who is interested, including Mr Google and Mr
> > Archives) about "what is a filesystem", "what is a [s]vnd" and "what
> > is rdist".
> 
> A very good mindset!

No one has ever said that about my warped brain before. Usually I
am told to Seek Professional Help or to Bugger Off.

> Wait, Openbsd-newbies has archives?

Presumably, I presume.  It could.  Mailman does archives. No archives?

> > Maybe I've missed something here.  Maybe all the diff calculating
> > would be done on plaintext files on the "master"?  There's a design
> > point here: the whole data collection has to be reconstructable if
> > the laptop is crushed by a bus.  So "version 0.0.0" of the data has
> > to be on the USB.  But that's no penalty, really, just a one-time
> > event. [thinking out loud here].  I need to look at subversion.
> > Can you "advocate" it over rcs/cvs?  In a way it's all same-same.
> 
> No I can't. Subversion is the successor to CVS though (I can say that

Let's say "subversion is a program proposed to replace CVS by
subversion's authors."

Other men (http://www.openbsd.org/opencvs) claim that rumors of
CVS's demise are exaggerated.

> with impunity because several of the original cvs developers are on
> svn) so without looking at anything else, presumably it's better.

Somehow, I can't imagine OpenBSD ever including python etc etc into
the base distribution.  (Perl is being considered for removal because
of its stunning cruft/bloat/hole count.)  So I can't conceive of
svn as a "successor" to cvs, until svn is turned into a Real Unix
Program by rewriting it all in C and using standard libraries.  Svn
will have a place in the Linux world, where there are no standards,
and in windoze hell where there are no programs.  But *BSD already
has cvs.  The place for svn is in "enterprise" or "institute"
situations with a heterogeneous environment.  svn looks very suited
for such.  I assume that it is, like most "new age" programs, not
backward compatible with either cvs scripts and repositories.

cvs is Good Enough.  No programmer ever believes that his pet project
is over.  svn appears to be a creation of such programmers.  yes,
I'm aware that this is how progress occurs!  But there should be a
constant struggle to progress, with a "dialectic" pitting old-farts
like me against the young Turks.  It keeps the youngsters sharp.
It exposes and weeds out errors that would otherwise gain a life
of their own (like windoze -- "Real Programmers" ignored microsoft
and the PeeCee until it was Too Late.  But that's another thread).

(Putting on my Chief Engineer/Project Manager hat (a shiny helmet
made from a pasta strainer and ostrich feathers)): Svn is not
suitable for this project because svn does not have man pages.  That
*alone* is sufficient to disqualify.  That disqualifies because it
proves (beyond further argument) that svn is not a Real Unix Utility
(all of which have man pages).  This means that svn cannot be used
by a casual day-to-day user, and has a learning curve that includes
getting a BOOK and STUDYING it.  This means that svn is an *opaque*,
*sealed* program, whose source code might as well be kept in a
vault, because no single person can read it and understand it.  I
am not looking for a program to play chess with, but one for
*sensitive* use on mission-critical data.  CVS is probably unacceptable,
too, at this point.  At the end of the "meeting" I would say that
it is a Bad Idea to use a tiny feature of a large/massive program
designed for another purpose.  The tiny feature of svn that would
be used is  "cmp" and "cp". (In essence).

        for file in name1 name2 name3 ....
        do
        if ! cmp -s /here/$file  /there/$file ; then
                cp /here/$file  /there/$file
                echo Updated /here/$file
        fi 
        done 2>&1 | mail -s "Backup stuffs" vim

That's the whole thing (except for crypto.  But svnd will be doing
the crypto transparently).  If files that change change in their first
few bytes (for loose meanings of "few"), this scriptlet might be
perfectly adequate and quite snappy.  (Another reason to know the
internal nature of the datafiles.  Suppose they had a cksum or
datestamp in a header?  Then that script would be all that could
be asked for.)  Note, cmp exits at the first diff.

We are encrypting and archiving sensitive, valuable data.  Third-party
software (any port) is probably out of the question for that reason alone.
A software that has a mailing list for its bugs is not suitable. Period.
We need bugless software.  (Said the woodchuck, spurring a reluctant
Rozinante to an uneven trot and turning to the next windmill.)

> > Another design-point -- the data collection has to be present and
> > usable regardless of the USB, i.e. the "repository" has to reside
> > on the laptop; the repository would have to be mirrored to the USB.
> > Maybe subversion has an efficient repository mirroring feature.
> 
> Right. I don't have a solution for this problem yet. I also don't know
> enough about these tools to say.

I have a tentative solution now.  Will post tonight or tomorrow.

Rdist is the efficient repository mirroring tool.  I'll "advocate" it
in a separate post.  (With benchmarks).  Benchmarks are puzzling
in certain ways. (to be disclosed later).

> > make'ing fetch of subversion now .... yikes, it's huge and it smells
> > like it wants python, and it might install a half-dozen libraries
> > of its own.  My idea of a system utility is something that lives
> > in /usr/bin and takes up about 55KB of disk.  Like rdist ;-)
> > (Source code of rdist and rdistd tar to a .tgz of 76KB -- subversion
> > is around 6000KB for the source.tgz).

A tgz of /usr/src/gnu/usr.bin/cvs comes to 2.5MB.  It has horrifying
"info" documentation. (info2html is my friend).  I say this in *defense*
of svn.

> Ick! It probably wants python for SCons (to avoid fighting with make).
> But I don't know about the rest.

Most of the cruft is needed during its compilation.  But 39 separate
libraries for a "copy" program is simply indefensible.  That's just
goofy.

Dave
-- 
  "Confound these wretched rodents! For every one I fling away,
               a dozen more vex me!" -- Doctor Doom
_______________________________________________
Openbsd-newbies mailing list
[email protected]
http://mailman.theapt.org/listinfo/openbsd-newbies

Reply via email to